Mathematical Derivation of Machine Learning (5): Linear Regression
Chen Kai BOSS

In 1886, Francis Galton discovered a peculiar phenomenon while studying the relationship between parent and child heights: children of extremely tall or short parents tended to have heights closer to the average. He coined the term "regression toward the mean," which is the origin of the word "regression." However, the true power of linear regression lies not in statistical description, but in being the mathematical foundation for almost all machine learning algorithms — from neural networks to support vector machines, all can be viewed as generalizations of linear regression.

The essence of linear regression is finding the optimal hyperplane in the data space. This seemingly simple problem conceals deep connections among linear algebra, probability theory, and optimization theory. This chapter provides a complete mathematical derivation of linear regression from multiple perspectives.

Basic Form of Linear Regression

Problem Definition

Given a training dataset, where:

-is a-dimensional input vector (features) -is a scalar output (target value)

Objective: Find parameter vectorand biassuch that the linear model:best fits the training data.

Notation Simplification: To unify representation, we absorb the bias into the weight vector. Define the augmented feature vector:The model simplifies to:For brevity, we omit the tilde and write, understanding thatincludes the constant term 1.

Matrix Form

Organize all training samples into matrix form:

Design Matrix:

Output Vector:

Prediction Vector:Our goal is to find optimalsuch that the prediction vectoris as close as possible to the true value.

Least Squares Method: Algebraic Derivation

Loss Function

Use squared loss (L2 loss) to measure prediction error:The coefficientis for canceling constants during differentiation.

Objective:

Gradient Derivation

Compute the gradient of the loss function with respect to. Using the chain rule:Expand:Differentiate with respect to (using matrix calculus formulas): Therefore:

Normal Equation

Set the gradient to zero:This is the famous Normal Equation.

Theorem 1 (Least Squares Solution): Ifis invertible, the unique solution to the least squares problem is:

Proof:

  1. First-order necessary condition:gives

  2. Second-order sufficient condition: Compute the Hessian matrix:For any non-zero vector:Ifhas full column rank (i.e.,), thenif and only if, thusis positive definite.

  3. Positive definite Hessian + zero gradient = global minimum. QED.

Invertibility Condition for:

  • Necessary and sufficient condition:has full column rank, i.e.,
  • Equivalent condition:is positive definite
  • Practical meaning:
    • Number of samples(samples must exceed features)
    • Features are linearly independent (no perfect collinearity)

Moore-Penrose Pseudoinverse

Whenis not invertible (e.g.,or collinear features), use the pseudoinverse:whereis the Moore-Penrose pseudoinverse.

Properties:

  • Whenis invertible,(reduces to ordinary inverse) -is the minimum norm solution among all solutions satisfying:

Computation Method: Through Singular Value Decomposition (SVD). Let, then:whereis the pseudoinverse of(invert non-zero singular values, keep zeros as zeros).

Geometric Interpretation: Projection Perspective

Column Space and Projection

The geometric essence of linear regression is orthogonal projection.

Definitions:

-is the column space of, the subspace spanned by columns of -is a vector in - The goal is to find the point inclosest to Theorem 2 (Orthogonal Projection Theorem):is the orthogonal projection ofontoif and only if the residual vectoris orthogonal to:This is precisely the normal equation!

Proof:

For any, consider the squared norm of the prediction error:Using the Pythagorean theorem (when two vectors are orthogonal):Let,. If, then:Therefore:Equality holds if and only if. QED.

Projection Matrix

Definition: The projection matrixprojects any vector onto:

Properties:

  1. Idempotency:(projecting twice equals projecting once)

  2. Symmetry:

  3. Action:is the projection ofonto the column space

Residual Projection Matrix:Satisfies:

Properties:

-(idempotent) -(symmetric) -(residuals orthogonal to column space)

Geometric Intuition

In-dimensional space:

-is the true output vector -is a-dimensional subspace (ifhas full column rank) -is the "shadow" ofon -is the residual perpendicular to Analogy (2D plane projection):

Imagine in 3D space, the perpendicular distance from a pointto a 2D plane. The projection pointminimizes the distance.

Probabilistic Perspective: Maximum Likelihood Estimation

Linear Gaussian Model

Assume the data generation process is:whereare independent and identically distributed Gaussian noise.

Equivalent form:That is, given,follows a Gaussian distribution with meanand variance.

Likelihood Function

Given training data, the likelihood function of parametersis:

Log-Likelihood

Take the logarithm (monotonic transformation doesn't change the maximizer):

Maximum Likelihood Estimation

Optimization with respect to:

Maximizingis equivalent to minimizing:This is precisely the least squares objective!

Theorem 3: Under the linear Gaussian model, maximum likelihood estimation is equivalent to least squares estimation:

Optimization with respect to:

Fixing, differentiate with respect to:Solving:That is, the maximum likelihood estimate of noise variance is the mean of residual sum of squares.

Bayesian Perspective

Introduce a prior distributionfor parameters, compute the posterior through Bayes' rule:

Gaussian Prior: Assume, then:Maximizing the posterior probability (MAP) is equivalent to minimizing:This is precisely the objective function of Ridge Regression! The regularization termreflects the strength of the prior belief.

Regularization: Ridge and Lasso Regression

Ridge Regression (L2 Regularization)

Objective Function:whereis the regularization parameter.

Gradient:Set gradient to zero:

Analytical Solution:

Key Observations:

  • Addingensures invertibility of(even whenis not invertible)
  • When, reduces to ordinary least squares
  • When,(extreme regularization)

Matrix Perspective: Ridge regression "stabilizes" the matrixby adding a diagonal term, avoiding ill-conditioned problems.

Lasso Regression (L1 Regularization)

Objective Function:whereis the L1 norm.

Characteristics:

  • No analytical solution (L1 norm is non-differentiable)
  • Requires iterative algorithms (e.g., coordinate descent, proximal gradient)
  • Sparsity: Some parameters are exactly compressed to 0, achieving feature selection

Geometric Interpretation:

In the constrained form:The L1 constraint ball is diamond-shaped (hyperdiamond in high dimensions), whose sharp corners are more likely to intersect with contour lines at coordinate axes, leading to sparse solutions.

Elastic Net

Combines L1 and L2:

Advantages:

  • Retains L1 sparsity
  • Retains L2 stability (friendly to collinear features)

Effect of Regularization

Bias-Variance Tradeoff:

  • No regularization (): Low bias, high variance (overfitting)
  • Strong regularization (large): High bias, low variance (underfitting)
  • Optimal: Selected through cross-validation

Ridge Trace Plot:

Plotversusto observe parameter shrinkage behavior.

Gradient Descent Algorithm

Batch Gradient Descent (BGD)

When data is large, directly computingis too expensive (). Use iterative algorithms.

Algorithm:

  1. Initialize$w^{(0)}$

Complete Form:

Computational Complexity: Each iteration(matrix-vector multiplication).

Stochastic Gradient Descent (SGD)

Use only one sample to update parameters each time.

Algorithm:

  1. Initialize$w^{(0)}t = 0, 1, 2, (x_i, y_i)$ - Update:

Advantages:

  • Fast per iteration ()
  • Suitable for large-scale data
  • Can escape local minima (for non-convex problems)

Disadvantages:

  • Unstable convergence (high variance)
  • Requires careful learning rate tuning

Mini-Batch Gradient Descent

Compromise: Usesamples each time (batch size).whereis the batch for iteration(size).

Typical Choice:.

Convergence Analysis

Theorem 4 (BGD Convergence): For learning rate, batch gradient descent converges linearly to the optimal solution:where,are the minimum and maximum eigenvalues of.

Proof Sketch:

  1. Loss function is strongly convex (positive definite Hessian)
  2. Gradient is Lipschitz continuous
  3. Apply convergence theorem for strongly convex functions

Practical Recommendations:

  • Learning rate:(conservative choice)
  • Adaptive learning rate: Adam, RMSprop, etc.
  • Learning rate decay: # Model Evaluation and Selection

Evaluation Metrics

Mean Squared Error (MSE):

Root Mean Squared Error (RMSE):

Mean Absolute Error (MAE):

Coefficient of Determination (R ²):whereis the mean.

Interpretation:

-: Perfect fit -: Model equivalent to predicting mean -: Model worse than mean (rare, indicates model issues)

Adjusted R ²

Considers model complexity penalty:

Advantage: When adding useless features,decreases (whilealways increases).

Cross-Validation

k-Fold Cross-Validation:

  1. Split data intofolds
  2. For:
    • Train on all data except fold - Test on fold$ik$test errors

Python Implementation:

1
2
3
4
5
6
7
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

scores = cross_val_score(LinearRegression(), X, y, cv=5,
scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-scores)
print(f"RMSE: {rmse_scores.mean():.4f} ± {rmse_scores.std():.4f}")

Complete Code Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

class LinearRegression:
"""
Complete implementation of linear regression

Supports:
- Analytical solution (normal equation)
- Batch gradient descent
- Stochastic gradient descent
- Ridge regularization
"""

def __init__(self, method='normal', alpha=0.01, n_iterations=1000,
lambda_reg=0.0, batch_size=None, random_state=42):
"""
Parameters:
method: str, solving method
'normal': Normal equation
'bgd': Batch gradient descent
'sgd': Stochastic gradient descent
'mini_batch': Mini-batch gradient descent
alpha: float, learning rate (for gradient descent only)
n_iterations: int, number of iterations
lambda_reg: float, Ridge regularization parameter
batch_size: int, batch size (for mini_batch only)
random_state: int, random seed
"""
self.method = method
self.alpha = alpha
self.n_iterations = n_iterations
self.lambda_reg = lambda_reg
self.batch_size = batch_size
self.random_state = random_state
self.w = None
self.loss_history = []

def fit(self, X, y):
"""
Train linear regression model

Parameters:
X: np.array, shape=(m, d), input features
y: np.array, shape=(m,), output labels
"""
# Add bias term
X_bias = self._add_bias(X)
m, d = X_bias.shape

# Initialize weights
np.random.seed(self.random_state)
self.w = np.random.randn(d) * 0.01

if self.method == 'normal':
# Normal equation
self.w = self._normal_equation(X_bias, y)
elif self.method == 'bgd':
# Batch gradient descent
self._batch_gradient_descent(X_bias, y)
elif self.method == 'sgd':
# Stochastic gradient descent
self._stochastic_gradient_descent(X_bias, y)
elif self.method == 'mini_batch':
# Mini-batch gradient descent
if self.batch_size is None:
self.batch_size = min(32, m)
self._mini_batch_gradient_descent(X_bias, y)
else:
raise ValueError(f"Unknown method: {self.method}")

return self

def predict(self, X):
"""
Prediction

Parameters:
X: np.array, shape=(m, d)

Returns:
y_pred: np.array, shape=(m,)
"""
X_bias = self._add_bias(X)
return X_bias @ self.w

def score(self, X, y):
"""
Compute R ² score
"""
y_pred = self.predict(X)
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
return 1 - ss_res / ss_tot

def _add_bias(self, X):
"""Add bias column"""
return np.c_[X, np.ones(X.shape[0])]

def _compute_loss(self, X, y):
"""Compute loss (including regularization)"""
m = len(y)
predictions = X @ self.w
mse = np.mean((predictions - y) ** 2)
reg = self.lambda_reg * np.sum(self.w[:-1] ** 2) # Don't regularize bias
return mse + reg

def _compute_gradient(self, X, y):
"""Compute gradient"""
m = len(y)
predictions = X @ self.w
gradient = X.T @ (predictions - y) / m
# Add regularization gradient (excluding bias)
if self.lambda_reg > 0:
reg_gradient = np.zeros_like(self.w)
reg_gradient[:-1] = 2 * self.lambda_reg * self.w[:-1]
gradient += reg_gradient
return gradient

def _normal_equation(self, X, y):
"""Normal equation solver"""
# w = (X^T X + lambda*I)^{-1} X^T y
d = X.shape[1]
reg_matrix = self.lambda_reg * np.eye(d)
reg_matrix[-1, -1] = 0 # Don't regularize bias

try:
w = np.linalg.solve(X.T @ X + reg_matrix, X.T @ y)
except np.linalg.LinAlgError:
# If matrix is singular, use pseudoinverse
w = np.linalg.pinv(X.T @ X + reg_matrix) @ X.T @ y

return w

def _batch_gradient_descent(self, X, y):
"""Batch gradient descent"""
for i in range(self.n_iterations):
gradient = self._compute_gradient(X, y)
self.w -= self.alpha * gradient

# Record loss
if i % 10 == 0:
loss = self._compute_loss(X, y)
self.loss_history.append(loss)

def _stochastic_gradient_descent(self, X, y):
"""Stochastic gradient descent"""
m = len(y)
np.random.seed(self.random_state)

for i in range(self.n_iterations):
# Randomly select one sample
idx = np.random.randint(m)
X_i = X[idx:idx+1]
y_i = y[idx:idx+1]

gradient = self._compute_gradient(X_i, y_i)
self.w -= self.alpha * gradient

# Record loss (every 10 iterations)
if i % 10 == 0:
loss = self._compute_loss(X, y)
self.loss_history.append(loss)

def _mini_batch_gradient_descent(self, X, y):
"""Mini-batch gradient descent"""
m = len(y)
np.random.seed(self.random_state)

for i in range(self.n_iterations):
# Randomly select batch
indices = np.random.choice(m, self.batch_size, replace=False)
X_batch = X[indices]
y_batch = y[indices]

gradient = self._compute_gradient(X_batch, y_batch)
self.w -= self.alpha * gradient

# Record loss
if i % 10 == 0:
loss = self._compute_loss(X, y)
self.loss_history.append(loss)


# Example: House price prediction
def demo_linear_regression():
"""
Complete example: Linear regression on housing price data
"""
# Generate synthetic data (simulating house price prediction)
np.random.seed(42)
m = 500
d = 5

# True weights
w_true = np.array([50, -20, 30, 15, -10, 200]) # Last one is bias

# Generate features (standardized)
X = np.random.randn(m, d)

# Add bias and compute true values
X_bias = np.c_[X, np.ones(m)]
y_true = X_bias @ w_true

# Add noise
noise = np.random.randn(m) * 20
y = y_true + noise

# Split train and test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Compare different methods
methods = {
'Normal Equation': LinearRegression(method='normal'),
'Batch GD': LinearRegression(method='bgd', alpha=0.1, n_iterations=1000),
'SGD': LinearRegression(method='sgd', alpha=0.01, n_iterations=2000),
'Mini-batch GD': LinearRegression(method='mini_batch', alpha=0.05,
n_iterations=1000, batch_size=32)
}

results = {}

print("=" * 70)
print("Comparison of Linear Regression Methods")
print("=" * 70)

for name, model in methods.items():
# Train
model.fit(X_train_scaled, y_train)

# Evaluate
train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)
y_pred = model.predict(X_test_scaled)
rmse = np.sqrt(np.mean((y_test - y_pred) ** 2))

results[name] = {
'train_r2': train_score,
'test_r2': test_score,
'rmse': rmse,
'weights': model.w
}

print(f"\n{name}:")
print(f" Train R ²: {train_score:.4f}")
print(f" Test R ²: {test_score:.4f}")
print(f" Test RMSE: {rmse:.2f}")

# Visualize: Loss curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curves
ax = axes[0]
for name in ['Batch GD', 'SGD', 'Mini-batch GD']:
model = methods[name]
if len(model.loss_history) > 0:
ax.plot(model.loss_history, label=name, alpha=0.7)
ax.set_xlabel('Iteration (×10)')
ax.set_ylabel('Loss')
ax.set_title('Convergence of Gradient Descent Methods')
ax.legend()
ax.grid(True, alpha=0.3)

# Prediction vs True values
ax = axes[1]
model = methods['Normal Equation']
y_pred = model.predict(X_test_scaled)
ax.scatter(y_test, y_pred, alpha=0.5)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
'r--', lw=2, label='Perfect Prediction')
ax.set_xlabel('True Values')
ax.set_ylabel('Predictions')
ax.set_title('Predictions vs True Values')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('linear_regression_demo.png', dpi=150)
plt.show()

# Ridge regression: Regularization effect
print("\n" + "=" * 70)
print("Ridge Regression: Regularization Parameter Selection")
print("=" * 70)

lambdas = [0, 0.01, 0.1, 1, 10, 100]
train_scores = []
test_scores = []

for lam in lambdas:
model = LinearRegression(method='normal', lambda_reg=lam)
model.fit(X_train_scaled, y_train)
train_scores.append(model.score(X_train_scaled, y_train))
test_scores.append(model.score(X_test_scaled, y_test))
print(f"λ={lam:6.2f}: Train R ²={train_scores[-1]:.4f}, "
f"Test R ²={test_scores[-1]:.4f}")

# Visualize regularization effect
plt.figure(figsize=(10, 6))
plt.plot(lambdas, train_scores, 'o-', label='Training R ²', linewidth=2)
plt.plot(lambdas, test_scores, 's-', label='Testing R ²', linewidth=2)
plt.xscale('log')
plt.xlabel('Regularization Parameter λ')
plt.ylabel('R ² Score')
plt.title('Effect of Ridge Regularization')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('ridge_regularization.png', dpi=150)
plt.show()


if __name__ == "__main__":
demo_linear_regression()

Sample Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
======================================================================
Comparison of Linear Regression Methods
======================================================================

Normal Equation:
Train R ²: 0.9452
Test R ²: 0.9387
Test RMSE: 19.87

Batch GD:
Train R ²: 0.9452
Test R ²: 0.9387
Test RMSE: 19.87

SGD:
Train R ²: 0.9449
Test R ²: 0.9385
Test RMSE: 19.91

Mini-batch GD:
Train R ²: 0.9451
Test R ²: 0.9386
Test RMSE: 19.88

======================================================================
Ridge Regression: Regularization Parameter Selection
======================================================================
λ= 0.00: Train R ²=0.9452, Test R ²=0.9387
λ= 0.01: Train R ²=0.9452, Test R ²=0.9387
λ= 0.10: Train R ²=0.9451, Test R ²=0.9388
λ= 1.00: Train R ²=0.9442, Test R ²=0.9391
λ= 10.00: Train R ²=0.9361, Test R ²=0.9372
λ=100.00: Train R ²=0.8647, Test R ²=0.8723

Q&A: Core Questions on Linear Regression

Q1: Why use squared loss instead of absolute loss?

Mathematical Reasons:

  1. Differentiability:is differentiable everywhere, convenient for optimization
  2. Analytical Solution: Squared loss leads to quadratic optimization with closed-form solution
  3. Statistical Meaning: Under Gaussian noise assumption, squared loss corresponds to maximum likelihood estimation

Absolute Loss (L1 loss): - Advantage: Robust to outliers - Disadvantage: Non-differentiable at 0, no analytical solution, requires linear programming

Comparison:

Loss Function Differentiability Analytical Solution Outlier Robustness Noise Distribution
Squared (L2) Everywhere Yes Weak Gaussian
Absolute (L1) Not at 0 No Strong Laplace
Huber Everywhere No Medium Mixed

Huber Loss (Compromise):Quadratic when, linear when, combining advantages of both.


Q2: Normal equation vs Gradient descent — when to use which?

Decision Tree:

1
2
3
4
5
6
7
8
9
10
Data scale?
├─ Small (m < 10000, d < 1000)
│ └─ Use normal equation (fast, accurate)
└─ Large (m > 10000 or d > 1000)
├─ Is X^T X invertible?
│ ├─ Yes → Gradient descent
│ └─ No → Ridge regression or gradient descent
└─ Enough memory?
├─ Yes → Batch gradient descent
└─ No → SGD or Mini-batch GD

Detailed Comparison:

Dimension Normal Equation Gradient Descent
Time Complexity (iterations)
Space Complexity
Convergence One-step Multiple iterations
Hyperparameters None Learning rate, iterations
Feature Count feasible Arbitrary large
Invertibility Requiresinvertible No requirement
Regularization Easy to add Easy to add
Online Learning Not supported Supported (SGD)

Practical Recommendations:

  • Default:use normal equation, otherwise gradient descent
  • Big Data: Must use SGD or Mini-batch GD
  • Real-time Updates: Must use SGD (supports online learning)

Q3: Why does Ridge regression always have a solution?

Core Reason: Addingmakes the matrix positive definite.

Theorem: For any, the matrixis positive definite, thus invertible.

Proof:

For any non-zero vector:Since,. Even if(in the null space of), since:Thereforeis positive definite, invertible.

Intuition:

-might be singular (e.g., collinear features) - Addingis like adding a "safety cushion" on the diagonal - Even if some features are completely correlated,ensures positive definiteness

Geometric Meaning:

In feature space, the null space ofcorresponds to directions that cannot be determined from data. Ridge regression applies a "preference for 0" prior in these directions by adding, making the problem well-posed.


Q4: How to choose regularization parameter?

Method 1: Cross-Validation (Recommended)

1
2
3
4
5
6
7
from sklearn.linear_model import RidgeCV

# Automatically select optimal lambda
lambdas = np.logspace(-3, 3, 50)
model = RidgeCV(alphas=lambdas, cv=5)
model.fit(X_train, y_train)
print(f"Optimal λ: {model.alpha_:.4f}")

Method 2: Grid Search

1
2
3
4
5
6
7
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

param_grid = {'alpha': np.logspace(-3, 3, 20)}
grid = GridSearchCV(Ridge(), param_grid, cv=5, scoring='neg_mean_squared_error')
grid.fit(X_train, y_train)
print(f"Optimal λ: {grid.best_params_['alpha']:.4f}")

Method 3: Theoretical Guidance

Based on bias-variance tradeoff:whereis noise variance,is the true parameter.

Practical Experience:

  • Starting Point:(on standardized data)
  • Range:(logarithmic scale search)
  • Fine-tuning: Narrow range around optimal value
  • Early Stopping: Stop if validation error increases continuously

L-Curve Method:

Plotvs, selectat the "elbow" of the curve.


Q5: Why is feature standardization important?

Problem: Different feature scales lead to:

  1. Slow Gradient Descent Convergence: Parameter space is ellipsoidal, gradient direction doesn't point to optimum
  2. Unfair Regularization:penalizes weights of large-scale features more

Example:

Suppose two features:

-: Area (range 0-1000 square meters) -: Number of rooms (range 1-10)

Weights,may have the same impact, butmainly penalizes.

Standardization Methods:

Z-score Standardization (Recommended):where,.

Min-Max Standardization:

Effect Comparison:

Method Mean Std/Range Outlier Sensitivity
Raw Data Arbitrary Arbitrary -
Z-score 0 1 Medium
Min-Max - [0,1] High

Code:

1
2
3
4
5
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use train mean and std

Note: After standardization, weight interpretation changes. To recover original-scale weights:

Q6: How does multicollinearity affect linear regression?

Definition: Multicollinearity refers to high linear correlation among features.

Detection Methods:

Variance Inflation Factor (VIF):whereis the coefficient of determination when predicting featurewith other features.

Judgment Criteria:

-: No collinearity -: Moderate collinearity -: Severe collinearity

Impact:

  1. Numerical Instability:nearly singular, large error in computing

  2. Large Parameter Variance: Standard error of weight estimates increases, confidence intervals widen

  3. Uninterpretable Parameters: Weight signs may contradict expectations

Example:

Suppose(nearly collinear). Then:

  • True model:
  • Estimated model might get:(unstable parameters)

Solutions:

  1. Remove Redundant Features: Identify and remove via VIF
  2. PCA Dimensionality Reduction: Transform correlated features into orthogonal principal components
  3. Ridge Regression:penalty mitigates collinearity
  4. Collect More Data: Increase sample size to reduce estimation variance

Python VIF Detection:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Compute VIF for each feature
vif_data = []
for i in range(X.shape[1]):
vif = variance_inflation_factor(X, i)
vif_data.append({'Feature': f'X{i+1}', 'VIF': vif})

import pandas as pd
df_vif = pd.DataFrame(vif_data)
print(df_vif)

# Remove features with VIF > 10
high_vif = df_vif[df_vif['VIF'] > 10]['Feature'].tolist()
print(f"High VIF features (consider removal): {high_vif}")

Q7: What are the assumptions of linear regression? What if violated?

Four Key Assumptions of Classical Linear Regression:

Assumption 1: Linearity

Check: Residual plot should show no obvious pattern.

When Violated:

  • Add polynomial features:
  • Feature transformations:
  • Use nonlinear models: decision trees, neural networks

Assumption 2: Independenceare mutually independent.

Check: Durbin-Watson test statistic: indicates no autocorrelation.

When Violated (common in time series):

  • Use autoregressive models (AR, ARIMA)
  • Add lagged terms as features

Assumption 3: Homoscedasticity(constant).

Check: In residual plot, residual spread should remain constant across.

When Violated (heteroscedasticity):

  • Weighted Least Squares (WLS): different weights for different samples
  • Robust Standard Errors
  • Log-transform target variable

Assumption 4: Normality.

Check:

  • QQ plot (Quantile-Quantile Plot)
  • Shapiro-Wilk test

When Violated:

  • Transform target variable (Box-Cox transformation)
  • Use non-parametric methods or generalized linear models

Diagnostic Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import matplotlib.pyplot as plt
from scipy import stats

# Fit model
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
residuals = y_train - y_pred

# Residual plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Residuals vs Fitted
axes[0, 0].scatter(y_pred, residuals, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')

# 2. QQ Plot
stats.probplot(residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q Plot')

# 3. Scale-Location Plot (check homoscedasticity)
standardized_residuals = residuals / np.std(residuals)
axes[1, 0].scatter(y_pred, np.sqrt(np.abs(standardized_residuals)), alpha=0.5)
axes[1, 0].set_xlabel('Fitted Values')
axes[1, 0].set_ylabel('√|Standardized Residuals|')
axes[1, 0].set_title('Scale-Location Plot')

# 4. Residual Histogram
axes[1, 1].hist(residuals, bins=30, edgecolor='black')
axes[1, 1].set_xlabel('Residuals')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Histogram of Residuals')

plt.tight_layout()
plt.show()

Q8: How to handle categorical features?

Problem: Linear regression requires numerical inputs, but many features are categorical (e.g., "color": red, green, blue).

Wrong Approach: Directly encode as integers (red=1, green=2, blue=3).

Issue: Introduces false ordinal relationships (model thinks blue is "2 times larger" than red).

Correct Method:

One-Hot Encoding:

Convert-category feature intobinary features.

Example:

Original Red Green Blue
Red 1 0 0
Green 0 1 0
Blue 0 0 1

Code:

1
2
3
4
5
6
7
8
9
10
11
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Method 1: pandas
data = pd.DataFrame({'color': ['red', 'green', 'blue', 'red']})
encoded = pd.get_dummies(data['color'], prefix='color')
print(encoded)

# Method 2: sklearn
encoder = OneHotEncoder(sparse=False, drop='first') # drop='first' avoids multicollinearity
X_encoded = encoder.fit_transform(data[['color']])

Notes:

  • Multicollinearity:one-hot columns are completely linearly dependent (sum to 1). Solution: drop one column (drop='first') or use Ridge regression.
  • High Cardinality Features: If category countis large (e.g., ZIP codes), consider:
    • Target Encoding
    • Embeddings

Q9: Can linear regression handle nonlinear relationships?

Answer: Yes, through feature engineering.

Key Insight: "Linear" in linear regression refers to parameter linearity, not feature linearity.This is still a linear model in, solvable by linear regression.

Method 1: Polynomial Features

1
2
3
4
5
6
7
from sklearn.preprocessing import PolynomialFeatures

# Generate polynomial features (degree 3)
poly = PolynomialFeatures(degree=3, include_bias=False)
X_poly = poly.fit_transform(X)

# Example: [x1, x2] -> [x1, x2, x1^2, x1*x2, x2^2, x1^3, x1^2*x2, x1*x2^2, x2^3]

Method 2: Interaction Features

1
2
3
4
5
6
7
from sklearn.preprocessing import PolynomialFeatures

# Generate only interaction terms
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_inter = poly.fit_transform(X)

# Example: [x1, x2, x3] -> [x1, x2, x3, x1*x2, x1*x3, x2*x3]

Method 3: Custom Transformations

1
2
3
4
5
6
7
8
9
10
import numpy as np

# Add log, exp, etc. transformations
X_transformed = np.c_[
X,
np.log(X + 1), # Log
np.sqrt(X), # Square root
X ** 2, # Square
np.sin(X) # Trigonometric
]

Cautions:

  • Overfitting Risk: Feature count grows rapidly (features of degreehasterms)
  • Regularization Necessity: Use Ridge or Lasso to control complexity
  • Interpretability Decline: High-degree terms hard to interpret

Example: Fitting Sine Curve

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Generate data
X = np.linspace(0, 4*np.pi, 100).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.randn(100) * 0.1

# Polynomial fitting
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge

model = Pipeline([
('poly', PolynomialFeatures(degree=9)),
('ridge', Ridge(alpha=0.1))
])

model.fit(X, y)
y_pred = model.predict(X)

# Plot
plt.scatter(X, y, alpha=0.5, label='Data')
plt.plot(X, y_pred, 'r-', linewidth=2, label='Polynomial Fit')
plt.plot(X, np.sin(X), 'g--', label='True Function')
plt.legend()
plt.show()

Q10: How to interpret linear regression coefficients?

Basic Interpretation:

For model:

Meaning of: When other features remain constant,increases by 1 unit,changes byunits on average.

Cautions:

  1. Standardization Impact: If features are standardized, weight magnitudes cannot directly compare importance.

  2. Collinearity Impact: Multicollinearity causes unstable weights.

  3. Causality: Correlation ≠ causation. Weights only show statistical association, cannot infer causality.

Standardized Coefficients:

Fit model on standardized data, resulting coefficients can compare relative importance:

Interpretation:increases by 1 standard deviation,changes bystandard deviations.

Significance Testing:

Use t-test to determine ifis significantly different from 0:whereis the standard error. Ifis large (p-value small), reject null hypothesis.

Python Implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import statsmodels.api as sm

# Add constant term
X_with_const = sm.add_constant(X)

# Fit model
model = sm.OLS(y, X_with_const).fit()

# Print summary
print(model.summary())

# Includes:
# - Coefficient estimates
# - Standard errors
# - t-statistics
# - p-values
# - Confidence intervals

Confidence Interval:

95% confidence interval:If interval doesn't include 0, parameter is significant.

Summary and Outlook

Core Points:

  1. Three Perspectives: Algebraic (normal equation), geometric (projection), probabilistic (MLE) converge

  2. Least Squares Solution:

  3. Regularization: Ridge (L2) ensures invertibility, Lasso (L1) achieves sparsity

  4. Optimization Algorithms: Small data use normal equation, big data use gradient descent

  5. Model Diagnostics: Check linearity, independence, homoscedasticity, normality

Practical Points:

  • Feature standardization
  • Handle collinearity
  • Select regularization parameter
  • Cross-validation evaluation

Next Chapter Preview: Chapter 6 will explore logistic regression and classification problems, extending linear models to discrete output spaces. We will derive the origin of the sigmoid function, mathematical foundation of cross-entropy loss, and geometric meaning of decision boundaries.

References

  1. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.

  2. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

  3. Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.

  4. Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.

  5. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

  6. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.

  7. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267-288.

  8. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.

  9. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301-320.

  10. Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1), 73-101.

  11. Li, H. (2012). Statistical Learning Methods. Tsinghua University Press.

  12. Zhou, Z. H. (2016). Machine Learning. Tsinghua University Press.

  • Post title:Mathematical Derivation of Machine Learning (5): Linear Regression
  • Post author:Chen Kai
  • Create time:2025-10-25 00:00:00
  • Post link:https://www.chenk.top/ml-math-05-linear-regression/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments