Machine Learning Mathematical Derivations (5): Linear Regression
Chen Kai BOSS

In 1886, Francis Galton discovered a peculiar phenomenon while studying the relationship between parents' and children's heights: parents with extreme heights tended to have children whose heights were closer to the average. He coined the term "regression toward the mean," which is where "regression" comes from. However, the true power of linear regression lies not in statistical description, but in its role as the mathematical foundation for almost all machine learning algorithms — from neural networks to support vector machines, all can be viewed as generalizations of linear regression.

The essence of linear regression is finding the optimal hyperplane in data space. This seemingly simple problem conceals deep connections between linear algebra, probability theory, and optimization. This chapter provides complete mathematical derivations of linear regression from multiple perspectives.

Basic Form of Linear Regression

Problem Definition

Given training dataset, where:

-is a-dimensional input vector (features) -is the scalar output (target value)

Objective: Find parameter vectorand biassuch that the linear model:

best fits the training data.

Notation Simplification: To unify notation, we absorb the bias into the weight vector. Define the augmented feature vector:

Then the model simplifies to:

For brevity, we omit the tildes and write, understanding thatincludes the constant term 1.

Matrix Form

Organize all training samples into matrix form:

Design Matrix:

Output Vector:

Prediction Vector:

Our goal is to find optimalsuch that prediction vectoris as close as possible to true values.

Least Squares: Algebraic Derivation

Loss Function

Using squared loss (L2 loss) to measure prediction error:

The coefficienteliminates constants when taking derivatives.

Objective:

Gradient Derivation

Calculate the gradient of the loss function with respect to. Using the chain rule:

Expanding:

Taking derivatives with respect to (using matrix differentiation formulas):

Therefore:

Normal Equation

Setting the gradient to zero:

This is the famous Normal Equation.

Theorem 1 (Least Squares Solution): Ifis invertible, the unique solution to the least squares problem is:

Proof:

  1. First-order necessary condition:gives

  2. Second-order sufficient condition: Computing the Hessian matrix:

For any non-zero vector:

Ifhas full column rank (i.e.,), thenif and only if, thusis positive definite.

  1. Positive definite Hessian + zero gradient = global minimum. QED.

Invertibility Conditions for:

  • Necessary and sufficient condition:has full column rank, i.e.,
  • Equivalent condition:is positive definite
  • Practical interpretation:
    • Number of samples(at least as many samples as features)
    • Features are linearly independent (no perfect collinearity)

Moore-Penrose Pseudoinverse

Whenis not invertible (e.g.,or collinear features), use the Pseudoinverse:

whereis the Moore-Penrose pseudoinverse.

Properties:

  • Whenis invertible,(degenerates to ordinary inverse) -is the minimum norm solution among all solutions satisfying:

Computation: Via Singular Value Decomposition (SVD). Let, then:

whereis the pseudoinverse of(non-zero singular values are inverted, zeros remain zero).

Geometric Interpretation: Projection Perspective

Column Space and Projection

The geometric essence of linear regression is orthogonal projection.

Definitions:

-is the column space of, i.e., the subspace spanned by columns of -is a vector in - The goal is to find the point inclosest to Theorem 2 (Orthogonal Projection Theorem):is the orthogonal projection ofontoif and only if the residual vectoris orthogonal to:

This is exactly the normal equation!

Proof:

For any, consider the squared norm of prediction error:

Using the Pythagorean theorem (when two vectors are orthogonal):

Let,. If, then:

Therefore:

Equality holds if and only if. QED.

Projection Matrix

Definition: The projection matrixprojects any vector onto:

Properties:

  1. Idempotent:(projecting twice equals projecting once)

  2. Symmetric:

  3. Effect:is the projection ofonto the column space

Residual Projection Matrix:

Satisfying:

Properties:

-(idempotent) -(symmetric) -(residual is orthogonal to column space)

Geometric Intuition

In-dimensional space:

-is the true output vector -is a-dimensional subspace (ifhas full column rank) -is the "shadow" ofon -is the residual perpendicular to Analogy (2D plane projection):

Imagine in 3D space, the perpendicular distance from a pointto a 2D plane. The projection pointminimizes this distance.

Least Squares Geometric Interpretation

The figure below shows OLS as orthogonal projection in 3D — observation vector is projected onto the column space of design matrix , with residual perpendicular to the column space:

Projection Geometry

Probabilistic Perspective: Maximum Likelihood Estimation

Linear Gaussian Model

Assume the data generating process is:

whereare independent and identically distributed Gaussian noise.

Equivalent form:

That is, given,follows a Gaussian distribution with meanand variance.

Likelihood Function

Given training data, the likelihood function for parametersis:

Log-Likelihood

Taking logarithm (monotonic transformation doesn't change the maximum point):

Maximum Likelihood Estimation

Optimization with respect to:

Maximizingis equivalent to minimizing:

This is exactly the least squares objective!

Theorem 3: Under the linear Gaussian model, maximum likelihood estimation is equivalent to least squares estimation:

Optimization with respect to:

Fixing, taking partial derivative with respect to:

Solving:

The MLE of noise variance is the mean of squared residuals.

Bayesian Perspective

Introducing a prior distributionon parameters, compute the posterior via Bayes' rule:

Gaussian Prior: Assuming, then:

Maximizing the posterior probability (MAP) is equivalent to minimizing:

This is the Ridge Regression objective function! The regularization termreflects the strength of prior belief.

Regularization: Ridge Regression and Lasso

Ridge Regression (L2 Regularization)

Objective Function:

whereis the regularization parameter.

Gradient:

Setting gradient to zero:

Analytical Solution:

Key Observations:

  • Addingguarantees invertibility of(even ifis not invertible)
  • When, degenerates to ordinary least squares
  • When,(extreme regularization)

Matrix Perspective: Ridge regression "stabilizes" the matrixby adding diagonal terms, avoiding ill-conditioning problems.

Lasso Regression (L1 Regularization)

Objective Function:

whereis the L1 norm.

Characteristics:

  • No analytical solution (L1 norm is not differentiable)
  • Requires iterative algorithms (e.g., coordinate descent, proximal gradient)
  • Sparsity: Some parameters are compressed exactly to zero, achieving feature selection

Geometric Interpretation:

In constrained form:

The L1 constraint ball is a diamond (hyperdiamond in high dimensions), whose corners are more likely to intersect with contour lines on coordinate axes, leading to sparse solutions.

Elastic Net

Combining L1 and L2:

Advantages:

  • Retains L1 sparsity
  • Retains L2 stability (friendly to collinear features)

Effects of Regularization

Bias-Variance Tradeoff:

  • No regularization (): Low bias, high variance (overfitting)
  • Strong regularization (large): High bias, low variance (underfitting)
  • Optimal: Selected via cross-validation

Ridge Trace:

Plotvarying withto observe shrinkage behavior of parameters.

Regularization Path Comparison
Regularization Comparison

Gradient Descent Algorithms

Batch Gradient Descent (BGD)

When data is large, directly computingis too expensive (). Use iterative algorithms.

Algorithm:

  1. Initialize
  2. Repeat until convergence:

whereis the learning rate, and gradient is:

Complete Form:

Computational Complexity:per iteration (matrix-vector multiplication).

Stochastic Gradient Descent (SGD)

Update parameters using only one sample at a time.

Algorithm:

  1. Initialize
  2. For:
    • Randomly select sample - Update:

Advantages:

  • Fast per iteration ()
  • Suitable for large-scale data
  • Ability to escape local minima (for non-convex problems)

Disadvantages:

  • Unstable convergence (high variance)
  • Requires careful learning rate tuning

Mini-batch Gradient Descent

Compromise: Usesamples (batch size) each time.

whereis the batch at iteration(size).

Typical choices:.

Convergence Analysis

Theorem 4 (BGD Convergence): For learning rate, batch gradient descent converges linearly to the optimal solution:

where,are the minimum and maximum eigenvalues of.

Proof Sketch:

  1. Loss function is strongly convex (Hessian is positive definite)
  2. Gradient is Lipschitz continuous
  3. Apply convergence theorem for strongly convex functions

Practical Recommendations:

  • Learning rate:(conservative choice)
  • Adaptive learning rates: Adam, RMSprop, etc.
  • Learning rate decay: Gradient Descent Convergence
Gradient Descent Animation

Model Evaluation and Selection

Evaluation Metrics

Mean Squared Error (MSE):

Root Mean Squared Error (RMSE):

Mean Absolute Error (MAE):

Coefficient of Determination (R ²):

whereis the mean.

Interpretation:

-: Perfect fit -: Model equivalent to predicting the mean -: Model worse than predicting the mean (rare, indicates model problems)

Adjusted R ²

Penalizing model complexity:

Advantage: Adding useless features decreases (while always increases).

The figure below shows four classic residual diagnostic plots — residuals vs fitted values, Q-Q plot, Scale-Location plot, and residual histogram — for checking model assumptions:

Residual Diagnostics

Cross-Validation

k-fold Cross-Validation:

  1. Split data intofolds
  2. For:
    • Train on all data except fold
    • Test on fold
  3. Average thetest errors

Python Implementation:

1
2
3
4
5
6
7
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

scores = cross_val_score(LinearRegression(), X, y, cv=5,
scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-scores)
print(f"RMSE: {rmse_scores.mean():.4f} ± {rmse_scores.std():.4f}")

Complete Code Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

class LinearRegression:
"""
Complete implementation of linear regression

Supports:
- Analytical solution (normal equation)
- Batch gradient descent
- Stochastic gradient descent
- Ridge regularization
"""

def __init__(self, method='normal', alpha=0.01, n_iterations=1000,
lambda_reg=0.0, batch_size=None, random_state=42):
"""
Parameters:
method: str, solving method
'normal': Normal equation
'bgd': Batch gradient descent
'sgd': Stochastic gradient descent
'mini_batch': Mini-batch gradient descent
alpha: float, learning rate (only for gradient descent)
n_iterations: int, number of iterations
lambda_reg: float, Ridge regularization parameter
batch_size: int, batch size (only for mini_batch)
random_state: int, random seed
"""
self.method = method
self.alpha = alpha
self.n_iterations = n_iterations
self.lambda_reg = lambda_reg
self.batch_size = batch_size
self.random_state = random_state
self.w = None
self.loss_history = []

def fit(self, X, y):
"""
Train linear regression model

Parameters:
X: np.array, shape=(m, d), input features
y: np.array, shape=(m,), output labels
"""
# Add bias term
X_bias = self._add_bias(X)
m, d = X_bias.shape

# Initialize weights
np.random.seed(self.random_state)
self.w = np.random.randn(d) * 0.01

if self.method == 'normal':
# Normal equation
self.w = self._normal_equation(X_bias, y)
elif self.method == 'bgd':
# Batch gradient descent
self._batch_gradient_descent(X_bias, y)
elif self.method == 'sgd':
# Stochastic gradient descent
self._stochastic_gradient_descent(X_bias, y)
elif self.method == 'mini_batch':
# Mini-batch gradient descent
if self.batch_size is None:
self.batch_size = min(32, m)
self._mini_batch_gradient_descent(X_bias, y)
else:
raise ValueError(f"Unknown method: {self.method}")

return self

def predict(self, X):
"""
Prediction

Parameters:
X: np.array, shape=(m, d)

Returns:
y_pred: np.array, shape=(m,)
"""
X_bias = self._add_bias(X)
return X_bias @ self.w

def score(self, X, y):
"""
Calculate R ² score
"""
y_pred = self.predict(X)
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
return 1 - ss_res / ss_tot

def _add_bias(self, X):
"""Add bias column"""
return np.c_[X, np.ones(X.shape[0])]

def _compute_loss(self, X, y):
"""Compute loss (including regularization)"""
m = len(y)
predictions = X @ self.w
mse = np.mean((predictions - y) ** 2)
reg = self.lambda_reg * np.sum(self.w[:-1] ** 2) # Don't regularize bias
return mse + reg

def _compute_gradient(self, X, y):
"""Compute gradient"""
m = len(y)
predictions = X @ self.w
gradient = X.T @ (predictions - y) / m
# Add regularization gradient (excluding bias)
if self.lambda_reg > 0:
reg_gradient = np.zeros_like(self.w)
reg_gradient[:-1] = 2 * self.lambda_reg * self.w[:-1]
gradient += reg_gradient
return gradient

def _normal_equation(self, X, y):
"""Solve via normal equation"""
# w = (X^T X + lambda*I)^{-1} X^T y
d = X.shape[1]
reg_matrix = self.lambda_reg * np.eye(d)
reg_matrix[-1, -1] = 0 # Don't regularize bias

try:
w = np.linalg.solve(X.T @ X + reg_matrix, X.T @ y)
except np.linalg.LinAlgError:
# If matrix is singular, use pseudoinverse
w = np.linalg.pinv(X.T @ X + reg_matrix) @ X.T @ y

return w

def _batch_gradient_descent(self, X, y):
"""Batch gradient descent"""
for i in range(self.n_iterations):
gradient = self._compute_gradient(X, y)
self.w -= self.alpha * gradient

# Record loss
if i % 10 == 0:
loss = self._compute_loss(X, y)
self.loss_history.append(loss)

def _stochastic_gradient_descent(self, X, y):
"""Stochastic gradient descent"""
m = len(y)
np.random.seed(self.random_state)

for i in range(self.n_iterations):
# Randomly select one sample
idx = np.random.randint(m)
X_i = X[idx:idx+1]
y_i = y[idx:idx+1]

gradient = self._compute_gradient(X_i, y_i)
self.w -= self.alpha * gradient

# Record loss (every 10 iterations)
if i % 10 == 0:
loss = self._compute_loss(X, y)
self.loss_history.append(loss)

def _mini_batch_gradient_descent(self, X, y):
"""Mini-batch gradient descent"""
m = len(y)
np.random.seed(self.random_state)

for i in range(self.n_iterations):
# Randomly select batch
indices = np.random.choice(m, self.batch_size, replace=False)
X_batch = X[indices]
y_batch = y[indices]

gradient = self._compute_gradient(X_batch, y_batch)
self.w -= self.alpha * gradient

# Record loss
if i % 10 == 0:
loss = self._compute_loss(X, y)
self.loss_history.append(loss)


# Example: Housing Price Prediction
def demo_linear_regression():
"""
Complete example: Linear regression application on simulated housing data
"""
# Generate synthetic data (simulating house price prediction)
np.random.seed(42)
m = 500
d = 5

# True weights
w_true = np.array([50, -20, 30, 15, -10, 200]) # Last one is bias

# Generate features (standardized)
X = np.random.randn(m, d)

# Add bias and compute true values
X_bias = np.c_[X, np.ones(m)]
y_true = X_bias @ w_true

# Add noise
noise = np.random.randn(m) * 20
y = y_true + noise

# Split training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Compare different methods
methods = {
'Normal Equation': LinearRegression(method='normal'),
'Batch GD': LinearRegression(method='bgd', alpha=0.1, n_iterations=1000),
'SGD': LinearRegression(method='sgd', alpha=0.01, n_iterations=2000),
'Mini-batch GD': LinearRegression(method='mini_batch', alpha=0.05,
n_iterations=1000, batch_size=32)
}

results = {}

print("=" * 70)
print("Linear Regression Method Comparison")
print("=" * 70)

for name, model in methods.items():
# Train
model.fit(X_train_scaled, y_train)

# Evaluate
train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)
y_pred = model.predict(X_test_scaled)
rmse = np.sqrt(np.mean((y_test - y_pred) ** 2))

results[name] = {
'train_r2': train_score,
'test_r2': test_score,
'rmse': rmse,
'weights': model.w
}

print(f"\n{name}:")
print(f" Training R ²: {train_score:.4f}")
print(f" Test R ²: {test_score:.4f}")
print(f" Test RMSE: {rmse:.2f}")

# Ridge Regression: Effect of regularization
print("\n" + "=" * 70)
print("Ridge Regression: Regularization Parameter Selection")
print("=" * 70)

lambdas = [0, 0.01, 0.1, 1, 10, 100]
train_scores = []
test_scores = []

for lam in lambdas:
model = LinearRegression(method='normal', lambda_reg=lam)
model.fit(X_train_scaled, y_train)
train_scores.append(model.score(X_train_scaled, y_train))
test_scores.append(model.score(X_test_scaled, y_test))
print(f"λ={lam:6.2f}: Training R ²={train_scores[-1]:.4f}, "
f"Test R ²={test_scores[-1]:.4f}")


if __name__ == "__main__":
demo_linear_regression()

Q&A: Core Linear Regression Questions

Q1: Why use squared loss instead of absolute value loss?

Mathematical Reasons:

  1. Differentiability:is differentiable everywhere, facilitating optimization
  2. Analytical solution: Squared loss leads to quadratic optimization problem with closed-form solution
  3. Statistical significance: Under Gaussian noise assumption, squared loss corresponds to maximum likelihood estimation

Absolute Value Loss (L1 loss): - Advantages: Robust to outliers (less affected by outliers) - Disadvantages: Not differentiable at zero, no analytical solution, requires linear programming

Comparison:

Loss Function Differentiability Analytical Solution Outlier Robustness Noise Distribution
Squared (L2) Everywhere Yes Weak Gaussian
Absolute (L1) Not at 0 No Strong Laplacian
Huber Everywhere No Medium Mixed

Huber Loss (compromise):

Quadratic when, linear when, combining advantages of both.


Q2: Normal equation vs gradient descent — when to use which?

Decision Tree:

1
2
3
4
5
6
7
8
9
10
Data Scale?
├─ Small scale (m < 10000, d < 1000)
│ └─ Use normal equation (fast, high precision)
└─ Large scale (m > 10000 or d > 1000)
├─ X^T X invertible?
│ ├─ Yes → Gradient descent
│ └─ No → Ridge regression or gradient descent
└─ Enough memory?
├─ Yes → Batch gradient descent
└─ No → SGD or Mini-batch GD

Detailed Comparison:

Dimension Normal Equation Gradient Descent
Time Complexity (is iterations)
Space Complexity
Convergence One step Requires multiple iterations
Hyperparameters None Learning rate, iterations
Feature Count feasible Arbitrary
Invertibility must be invertible Not required
Regularization Easy to add Easy to add
Online Learning Not supported Supported (SGD)

Practical Recommendations:

  • Default choice: Use normal equation for, otherwise gradient descent
  • Big data: Must use SGD or Mini-batch GD
  • Real-time updates: Must use SGD (supports online learning)

Q3: Why does Ridge regression solution always exist?

Core Reason: Addingmakes the matrix positive definite.

Theorem: For any, the matrixis positive definite, hence invertible.

Proof:

For any non-zero vector:

Since, we have. Even if(is in the null space of), due to:

Thereforeis positive definite, hence invertible.

Intuition:

-may be singular (e.g., collinear features) - Addingis like adding a "safety cushion" to the diagonal - Even if some features are perfectly correlated,ensures positive definiteness

Geometric Interpretation:

In feature space, the null space ofcorresponds to directions that cannot be determined from data. Ridge regression, by adding, imposes a "prefer zero" prior in these directions, making the problem well-posed.


Q4: How to choose regularization parameter?

Method 1: Cross-Validation (Recommended)

1
2
3
4
5
6
7
from sklearn.linear_model import RidgeCV

# Automatically select optimal lambda
lambdas = np.logspace(-3, 3, 50)
model = RidgeCV(alphas=lambdas, cv=5)
model.fit(X_train, y_train)
print(f"Optimal λ: {model.alpha_:.4f}")

Method 2: Grid Search

1
2
3
4
5
6
7
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

param_grid = {'alpha': np.logspace(-3, 3, 20)}
grid = GridSearchCV(Ridge(), param_grid, cv=5, scoring='neg_mean_squared_error')
grid.fit(X_train, y_train)
print(f"Optimal λ: {grid.best_params_['alpha']:.4f}")

Method 3: Theoretical Guidance

Based on bias-variance tradeoff:

whereis noise variance,is true parameters.

Practical Experience:

  • Starting point:(on standardized data)
  • Range:(logarithmic scale search)
  • Fine-tuning: Narrow range around optimal value
  • Early stopping: Stop if validation error keeps increasing

L-curve Method:

Plotvs, selectat the "elbow" of the curve.


Q5: Why is feature standardization important?

Problem: Different feature scales cause:

  1. Slow GD convergence: Parameter space is ellipsoidal, gradient doesn't point to optimum
  2. Unfair regularization:penalizes large-scale features more

Example:

Suppose two features:

-: Area (range 0-1000 sqm) -: Room count (range 1-10)

Weights,may have same impact, butmainly penalizes.

Standardization Methods:

Z-score Standardization (Recommended):

where,.

Min-Max Standardization:

Effect Comparison:

Method Mean Std/Range Outlier Sensitivity
Raw data Any Any -
Z-score 0 1 Medium
Min-Max - [0,1] High

Code:

1
2
3
4
5
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use training set mean and std

Note: After standardization, weight interpretation changes. To recover original-scale weights:

Q6: How does multicollinearity affect linear regression?

Definition: Multicollinearity refers to high linear correlation among features.

Detection Method:

Variance Inflation Factor (VIF):

whereis the coefficient of determination when predicting featurefrom other features.

Guidelines:

-: No collinearity -: Moderate collinearity -: Severe collinearity

Effects:

  1. Numerical instability:near singular, large errors in

  2. High parameter variance: Standard errors increase, confidence intervals widen

  3. Uninterpretable parameters: Weight signs may be opposite to expectations

Solutions:

  1. Remove redundant features: Identify and remove via VIF
  2. PCA dimensionality reduction: Transform correlated features into orthogonal principal components
  3. Ridge regression:penalty mitigates collinearity
  4. Collect more data: Larger sample size reduces estimation variance

Q7: What are the assumptions of linear regression? What if violated?

Four Classic Assumptions:

Assumption 1: Linearity

Test: Residual plot should show no obvious pattern.

When Violated:

  • Add polynomial features:
  • Feature transformation:
  • Use nonlinear models: Decision trees, neural networks

Assumption 2: Independenceare mutually independent.

Test: Durbin-Watson test statistic: indicates no autocorrelation.

When Violated (common in time series):

  • Use autoregressive models (AR, ARIMA)
  • Add lagged terms as features

Assumption 3: Homoscedasticity(constant).

Test: In residual plot, spread of residuals should remain constant with.

When Violated (heteroscedasticity):

  • Weighted Least Squares (WLS): Different weights for different samples
  • Robust Standard Errors
  • Log-transform target variable

Assumption 4: Normality.

Tests:

  • QQ plot (Quantile-Quantile Plot)
  • Shapiro-Wilk test

When Violated:

  • Transform target variable (Box-Cox transformation)
  • Use nonparametric methods or generalized linear models

Q8: How to handle categorical features?

Problem: Linear regression requires numerical inputs, but many features are categorical (e.g., "color": red, green, blue).

Wrong Approach: Directly encode as integers (red=1, green=2, blue=3).

Problem: Introduces spurious ordinal relationship (model thinks blue is "larger" than red by 2).

Correct Methods:

One-Hot Encoding:

Convert-category feature intobinary features.

Example:

Original Red Green Blue
Red 1 0 0
Green 0 1 0
Blue 0 0 1

Code:

1
2
3
4
5
6
7
8
9
10
11
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Method 1: pandas
data = pd.DataFrame({'color': ['red', 'green', 'blue', 'red']})
encoded = pd.get_dummies(data['color'], prefix='color')
print(encoded)

# Method 2: sklearn
encoder = OneHotEncoder(sparse=False, drop='first') # drop='first' avoids collinearity
X_encoded = encoder.fit_transform(data[['color']])

Notes:

  • Multicollinearity:one-hot columns are perfectly collinear (sum to 1). Solution: Drop one column (drop='first') or use Ridge regression.
  • High cardinality features: Ifis large (e.g., zip codes), consider:
    • Target Encoding
    • Embeddings

Q9: Can linear regression handle nonlinear relationships?

Answer: Yes, through feature engineering.

Key Insight: "Linear" in linear regression refers to linearity in parameters, not features.

This is still a linear model with respect to, solvable with linear regression.

Method 1: Polynomial Features

1
2
3
4
5
6
7
from sklearn.preprocessing import PolynomialFeatures

# Generate polynomial features (degree 3)
poly = PolynomialFeatures(degree=3, include_bias=False)
X_poly = poly.fit_transform(X)

# Example: [x1, x2] -> [x1, x2, x1^2, x1*x2, x2^2, x1^3, x1^2*x2, x1*x2^2, x2^3]

Method 2: Interaction Features

1
2
3
4
5
6
7
from sklearn.preprocessing import PolynomialFeatures

# Only interaction terms
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_inter = poly.fit_transform(X)

# Example: [x1, x2, x3] -> [x1, x2, x3, x1*x2, x1*x3, x2*x3]

Method 3: Custom Transformations

1
2
3
4
5
6
7
8
9
10
import numpy as np

# Add log, exponential, etc.
X_transformed = np.c_[
X,
np.log(X + 1), # Logarithm
np.sqrt(X), # Square root
X ** 2, # Square
np.sin(X) # Trigonometric
]

Caution:

  • Overfitting risk: Feature count grows rapidly (features with degreepolynomial hasterms)
  • Regularization necessary: Use Ridge or Lasso to control complexity
  • Reduced interpretability: High-degree terms are hard to interpret

Q10: How to interpret linear regression coefficients?

Basic Interpretation:

For model:

Meaning of: Holding other features constant, whenincreases by 1 unit,changes byunits on average.

Caveats:

  1. Standardization effect: If features are standardized, weight magnitudes cannot directly compare importance.

  2. Collinearity effect: Multicollinearity makes weights unstable.

  3. Causal relationship: Correlation ≠ causation. Weights only indicate statistical association, not causal inference.

Standardized Coefficients:

Fitting on standardized data yields coefficients that can compare relative importance:

Interpretation: Whenincreases by 1 standard deviation,changes bystandard deviations.

Significance Testing:

Use t-test to determine ifis significantly non-zero:

whereis standard error. Ifis large (p-value small), reject null hypothesis.

Python Implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import statsmodels.api as sm

# Add constant term
X_with_const = sm.add_constant(X)

# Fit model
model = sm.OLS(y, X_with_const).fit()

# Print summary
print(model.summary())

# Includes:
# - Coefficient estimates
# - Standard errors
# - t-statistics
# - p-values
# - Confidence intervals

Confidence Interval:

95% confidence interval:

If the interval does not contain 0, the parameter is significant.

Summary and Outlook

Key Points:

  1. Three perspectives: Algebraic (normal equation), geometric (projection), probabilistic (MLE) all lead to same result

  2. Least squares solution:

  3. Regularization: Ridge (L2) ensures invertibility, Lasso (L1) achieves sparsity

  4. Optimization algorithms: Normal equation for small data, gradient descent for large data

  5. Model diagnostics: Check linearity, independence, homoscedasticity, normality

Practical Tips:

  • Feature standardization
  • Handle collinearity
  • Select regularization parameter
  • Cross-validation for evaluation

Next Chapter Preview: Chapter 6 will explore logistic regression and classification, extending linear models to discrete output spaces. We will derive the origin of the sigmoid function, mathematical foundations of cross-entropy loss, and geometric interpretation of decision boundaries.

✏️ Exercises and Solutions

Exercise 1: Normal Equation Derivation

Problem: Derive the normal equation from matrix calculus.

Solution:

Loss:

Gradient:

Setting: , so when is invertible.

Exercise 2: Bayesian Interpretation of Ridge

Problem: Prove that Ridge regression is equivalent to MAP estimation with Gaussian prior , and find the relationship between and.

Solution:

MAP minimizes:

Comparing with Ridge:

Therefore. Larger prior variancemeans weaker regularization.

Exercise 3: Projection Matrix Properties

Problem: Prove that is symmetric and idempotent (, ).

Solution:

Symmetry: (since is symmetric).

Idempotency: .

Geometric meaning: orthogonally projects any vector onto the column space of . Projecting twice is the same as projecting once.

Exercise 4: MLE of Noise Variance

Problem: Under (), derive the MLE ofand show it is biased.

Solution:

Log-likelihood:

Setting:

This is biased: . Unbiased estimator: .

Exercise 5: Multicollinearity and Ridge

Problem: When, analyze how OLS variance is affected and why Ridge helps.

Solution:

. High correlation makes near-singular, inflating.

VIF, amplifying variance 50x.

Ridge: . Adding stabilizes inversion, reducing variance at the cost of small bias — a favorable MSE trade-off.

References

  1. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.

  2. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

  3. Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.

  4. Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.

  5. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

  6. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.

  7. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267-288.

  8. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.

  9. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301-320.

  10. Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1), 73-101.

  • Post title:Machine Learning Mathematical Derivations (5): Linear Regression
  • Post author:Chen Kai
  • Create time:2021-09-18 09:00:00
  • Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-5-Linear-Regression/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments