In 1886, Francis Galton discovered a peculiar phenomenon while
studying the relationship between parent and child heights: children of
extremely tall or short parents tended to have heights closer to the
average. He coined the term "regression toward the mean," which is the
origin of the word "regression." However, the true power of linear
regression lies not in statistical description, but in being the
mathematical foundation for almost all machine learning algorithms —
from neural networks to support vector machines, all can be viewed as
generalizations of linear regression.
The essence of linear regression is finding the optimal hyperplane in
the data space. This seemingly simple problem conceals deep connections
among linear algebra, probability theory, and optimization theory. This
chapter provides a complete mathematical derivation of linear regression
from multiple perspectives.
Objective: Find parameter vectorand biassuch that the linear
model:best fits
the training data.
Notation Simplification: To unify representation, we
absorb the bias into the weight vector. Define the augmented feature
vector:The model simplifies
to:For brevity, we omit the tilde and write, understanding thatincludes the constant term 1.
Matrix Form
Organize all training samples into matrix form:
Design Matrix:
Output Vector:
Prediction Vector:Our goal is
to find optimalsuch that the
prediction vectoris as close
as possible to the true value.
Least Squares Method:
Algebraic Derivation
Loss Function
Use squared loss (L2 loss) to measure prediction error:The coefficientis for canceling constants
during differentiation.
Objective:
Gradient Derivation
Compute the gradient of the loss function with respect to. Using the chain rule:Expand:Differentiate with respect to (using matrix calculus formulas):Therefore:
Normal Equation
Set the gradient to zero:This is the famous Normal Equation.
Theorem 1 (Least Squares Solution): Ifis invertible, the unique solution
to the least squares problem is:
Proof:
First-order necessary condition:gives
Second-order sufficient condition: Compute the
Hessian matrix:For any non-zero vector:Ifhas
full column rank (i.e.,), thenif and
only if, thusis positive definite.
Positive definite Hessian + zero gradient = global
minimum. QED.
Invertibility Condition for:
Necessary and sufficient condition:has full column rank, i.e.,
Equivalent condition:is positive definite
Practical meaning:
Number of samples(samples must exceed features)
Features are linearly independent (no perfect collinearity)
Moore-Penrose Pseudoinverse
Whenis not invertible
(e.g.,or collinear
features), use the pseudoinverse:whereis the Moore-Penrose
pseudoinverse.
Properties:
Whenis invertible,(reduces to
ordinary inverse) -is the
minimum norm solution among all solutions satisfying:
Computation Method: Through Singular Value
Decomposition (SVD). Let, then:whereis the
pseudoinverse of(invert
non-zero singular values, keep zeros as zeros).
Geometric
Interpretation: Projection Perspective
Column Space and Projection
The geometric essence of linear regression is orthogonal
projection.
Definitions:
-is the column
space of, the subspace spanned by
columns of -is a vector in - The goal is to find the
point inclosest
toTheorem 2 (Orthogonal
Projection Theorem):is the orthogonal projection ofontoif and only if the residual
vectoris orthogonal
to:This is precisely the
normal equation!
Proof:
For any,
consider the squared norm of the prediction error:Using the Pythagorean theorem (when two vectors are
orthogonal):Let,.
If, then:Therefore:Equality holds
if and only if. QED.
Projection Matrix
Definition: The projection matrixprojects any vector onto:
-(idempotent) -(symmetric) -(residuals orthogonal to column
space)
Geometric Intuition
In-dimensional space:
-is the true output vector
-is a-dimensional subspace (ifhas full column rank) -is the "shadow" ofon -is the residual
perpendicular toAnalogy (2D plane projection):
Imagine in 3D space, the perpendicular distance from a pointto a 2D plane. The projection pointminimizes the distance.
Probabilistic
Perspective: Maximum Likelihood Estimation
Linear Gaussian Model
Assume the data generation process is:whereare independent and identically
distributed Gaussian noise.
Equivalent form:That is, given,follows a Gaussian distribution with
meanand variance.
Likelihood Function
Given training data, the likelihood function of parametersis:
Log-Likelihood
Take the logarithm (monotonic transformation doesn't change the
maximizer):
Maximum Likelihood
Estimation
Optimization with respect to:
Maximizingis
equivalent to minimizing:This is precisely the least squares
objective!
Theorem 3: Under the linear Gaussian model, maximum
likelihood estimation is equivalent to least squares estimation:
Optimization with respect to:
Fixing, differentiate
with respect to:Solving:That is, the
maximum likelihood estimate of noise variance is the mean of residual
sum of squares.
Bayesian Perspective
Introduce a prior distributionfor parameters, compute the posterior
through Bayes' rule:
Gaussian Prior: Assume, then:Maximizing the posterior probability (MAP) is equivalent
to minimizing:This is precisely the objective
function of Ridge Regression! The regularization
termreflects the strength of the prior belief.
Regularization: Ridge
and Lasso Regression
Ridge Regression (L2
Regularization)
Objective Function:whereis the regularization parameter.
Gradient:Set
gradient to zero:
Analytical Solution:
Key Observations:
Addingensures
invertibility of(even whenis not
invertible)
When, reduces to
ordinary least squares
When,(extreme
regularization)
Matrix Perspective: Ridge regression "stabilizes"
the matrixby adding a diagonal
term, avoiding ill-conditioned problems.
Lasso Regression (L1
Regularization)
Objective Function:whereis the L1
norm.
Characteristics:
No analytical solution (L1 norm is non-differentiable)
Sparsity: Some parameters are exactly compressed to
0, achieving feature selection
Geometric Interpretation:
In the constrained form:The L1 constraint
ball is diamond-shaped (hyperdiamond in high dimensions), whose sharp
corners are more likely to intersect with contour lines at coordinate
axes, leading to sparse solutions.
Elastic Net
Combines L1 and L2:
Advantages:
Retains L1 sparsity
Retains L2 stability (friendly to collinear features)
Effect of Regularization
Bias-Variance Tradeoff:
No regularization (): Low bias, high variance (overfitting)
Strong regularization (large): High bias, low variance
(underfitting)
Compromise: Usesamples each
time (batch size).whereis
the batch for iteration(size).
Typical Choice:.
Convergence Analysis
Theorem 4 (BGD Convergence): For learning rate,
batch gradient descent converges linearly to the optimal solution:where,are the minimum and maximum
eigenvalues of.
Proof Sketch:
Loss function is strongly convex (positive definite Hessian)
Gradient is Lipschitz continuous
Apply convergence theorem for strongly convex functions
Practical Recommendations:
Learning rate:(conservative choice)
Adaptive learning rate: Adam, RMSprop, etc.
Learning rate decay: # Model Evaluation and Selection
Evaluation Metrics
Mean Squared Error (MSE):
Root Mean Squared Error (RMSE):
Mean Absolute Error (MAE):
Coefficient of Determination (R ²):whereis
the mean.
Interpretation:
-: Perfect fit -: Model equivalent to predicting
mean -: Model worse than
mean (rare, indicates model issues)
Adjusted R ²
Considers model complexity penalty:
Advantage: When adding useless features,decreases (whilealways increases).
Cross-Validation
k-Fold Cross-Validation:
Split data intofolds
For:
Train on all data except fold
- Test on fold$ik$test errors
Python Implementation:
1 2 3 4 5 6 7
from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression
import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler
classLinearRegression: """ Complete implementation of linear regression Supports: - Analytical solution (normal equation) - Batch gradient descent - Stochastic gradient descent - Ridge regularization """ def__init__(self, method='normal', alpha=0.01, n_iterations=1000, lambda_reg=0.0, batch_size=None, random_state=42): """ Parameters: method: str, solving method 'normal': Normal equation 'bgd': Batch gradient descent 'sgd': Stochastic gradient descent 'mini_batch': Mini-batch gradient descent alpha: float, learning rate (for gradient descent only) n_iterations: int, number of iterations lambda_reg: float, Ridge regularization parameter batch_size: int, batch size (for mini_batch only) random_state: int, random seed """ self.method = method self.alpha = alpha self.n_iterations = n_iterations self.lambda_reg = lambda_reg self.batch_size = batch_size self.random_state = random_state self.w = None self.loss_history = [] deffit(self, X, y): """ Train linear regression model Parameters: X: np.array, shape=(m, d), input features y: np.array, shape=(m,), output labels """ # Add bias term X_bias = self._add_bias(X) m, d = X_bias.shape # Initialize weights np.random.seed(self.random_state) self.w = np.random.randn(d) * 0.01 if self.method == 'normal': # Normal equation self.w = self._normal_equation(X_bias, y) elif self.method == 'bgd': # Batch gradient descent self._batch_gradient_descent(X_bias, y) elif self.method == 'sgd': # Stochastic gradient descent self._stochastic_gradient_descent(X_bias, y) elif self.method == 'mini_batch': # Mini-batch gradient descent if self.batch_size isNone: self.batch_size = min(32, m) self._mini_batch_gradient_descent(X_bias, y) else: raise ValueError(f"Unknown method: {self.method}") return self defpredict(self, X): """ Prediction Parameters: X: np.array, shape=(m, d) Returns: y_pred: np.array, shape=(m,) """ X_bias = self._add_bias(X) return X_bias @ self.w defscore(self, X, y): """ Compute R ² score """ y_pred = self.predict(X) ss_res = np.sum((y - y_pred) ** 2) ss_tot = np.sum((y - np.mean(y)) ** 2) return1 - ss_res / ss_tot def_add_bias(self, X): """Add bias column""" return np.c_[X, np.ones(X.shape[0])] def_compute_loss(self, X, y): """Compute loss (including regularization)""" m = len(y) predictions = X @ self.w mse = np.mean((predictions - y) ** 2) reg = self.lambda_reg * np.sum(self.w[:-1] ** 2) # Don't regularize bias return mse + reg def_compute_gradient(self, X, y): """Compute gradient""" m = len(y) predictions = X @ self.w gradient = X.T @ (predictions - y) / m # Add regularization gradient (excluding bias) if self.lambda_reg > 0: reg_gradient = np.zeros_like(self.w) reg_gradient[:-1] = 2 * self.lambda_reg * self.w[:-1] gradient += reg_gradient return gradient def_normal_equation(self, X, y): """Normal equation solver""" # w = (X^T X + lambda*I)^{-1} X^T y d = X.shape[1] reg_matrix = self.lambda_reg * np.eye(d) reg_matrix[-1, -1] = 0# Don't regularize bias try: w = np.linalg.solve(X.T @ X + reg_matrix, X.T @ y) except np.linalg.LinAlgError: # If matrix is singular, use pseudoinverse w = np.linalg.pinv(X.T @ X + reg_matrix) @ X.T @ y return w def_batch_gradient_descent(self, X, y): """Batch gradient descent""" for i inrange(self.n_iterations): gradient = self._compute_gradient(X, y) self.w -= self.alpha * gradient # Record loss if i % 10 == 0: loss = self._compute_loss(X, y) self.loss_history.append(loss) def_stochastic_gradient_descent(self, X, y): """Stochastic gradient descent""" m = len(y) np.random.seed(self.random_state) for i inrange(self.n_iterations): # Randomly select one sample idx = np.random.randint(m) X_i = X[idx:idx+1] y_i = y[idx:idx+1] gradient = self._compute_gradient(X_i, y_i) self.w -= self.alpha * gradient # Record loss (every 10 iterations) if i % 10 == 0: loss = self._compute_loss(X, y) self.loss_history.append(loss) def_mini_batch_gradient_descent(self, X, y): """Mini-batch gradient descent""" m = len(y) np.random.seed(self.random_state) for i inrange(self.n_iterations): # Randomly select batch indices = np.random.choice(m, self.batch_size, replace=False) X_batch = X[indices] y_batch = y[indices] gradient = self._compute_gradient(X_batch, y_batch) self.w -= self.alpha * gradient # Record loss if i % 10 == 0: loss = self._compute_loss(X, y) self.loss_history.append(loss)
# Example: House price prediction defdemo_linear_regression(): """ Complete example: Linear regression on housing price data """ # Generate synthetic data (simulating house price prediction) np.random.seed(42) m = 500 d = 5 # True weights w_true = np.array([50, -20, 30, 15, -10, 200]) # Last one is bias # Generate features (standardized) X = np.random.randn(m, d) # Add bias and compute true values X_bias = np.c_[X, np.ones(m)] y_true = X_bias @ w_true # Add noise noise = np.random.randn(m) * 20 y = y_true + noise # Split train and test X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Standardize scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Compare different methods methods = { 'Normal Equation': LinearRegression(method='normal'), 'Batch GD': LinearRegression(method='bgd', alpha=0.1, n_iterations=1000), 'SGD': LinearRegression(method='sgd', alpha=0.01, n_iterations=2000), 'Mini-batch GD': LinearRegression(method='mini_batch', alpha=0.05, n_iterations=1000, batch_size=32) } results = {} print("=" * 70) print("Comparison of Linear Regression Methods") print("=" * 70) for name, model in methods.items(): # Train model.fit(X_train_scaled, y_train) # Evaluate train_score = model.score(X_train_scaled, y_train) test_score = model.score(X_test_scaled, y_test) y_pred = model.predict(X_test_scaled) rmse = np.sqrt(np.mean((y_test - y_pred) ** 2)) results[name] = { 'train_r2': train_score, 'test_r2': test_score, 'rmse': rmse, 'weights': model.w } print(f"\n{name}:") print(f" Train R ²: {train_score:.4f}") print(f" Test R ²: {test_score:.4f}") print(f" Test RMSE: {rmse:.2f}") # Visualize: Loss curves fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Loss curves ax = axes[0] for name in ['Batch GD', 'SGD', 'Mini-batch GD']: model = methods[name] iflen(model.loss_history) > 0: ax.plot(model.loss_history, label=name, alpha=0.7) ax.set_xlabel('Iteration (×10)') ax.set_ylabel('Loss') ax.set_title('Convergence of Gradient Descent Methods') ax.legend() ax.grid(True, alpha=0.3) # Prediction vs True values ax = axes[1] model = methods['Normal Equation'] y_pred = model.predict(X_test_scaled) ax.scatter(y_test, y_pred, alpha=0.5) ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2, label='Perfect Prediction') ax.set_xlabel('True Values') ax.set_ylabel('Predictions') ax.set_title('Predictions vs True Values') ax.legend() ax.grid(True, alpha=0.3) plt.tight_layout() plt.savefig('linear_regression_demo.png', dpi=150) plt.show() # Ridge regression: Regularization effect print("\n" + "=" * 70) print("Ridge Regression: Regularization Parameter Selection") print("=" * 70) lambdas = [0, 0.01, 0.1, 1, 10, 100] train_scores = [] test_scores = [] for lam in lambdas: model = LinearRegression(method='normal', lambda_reg=lam) model.fit(X_train_scaled, y_train) train_scores.append(model.score(X_train_scaled, y_train)) test_scores.append(model.score(X_test_scaled, y_test)) print(f"λ={lam:6.2f}: Train R ²={train_scores[-1]:.4f}, " f"Test R ²={test_scores[-1]:.4f}") # Visualize regularization effect plt.figure(figsize=(10, 6)) plt.plot(lambdas, train_scores, 'o-', label='Training R ²', linewidth=2) plt.plot(lambdas, test_scores, 's-', label='Testing R ²', linewidth=2) plt.xscale('log') plt.xlabel('Regularization Parameter λ') plt.ylabel('R ² Score') plt.title('Effect of Ridge Regularization') plt.legend() plt.grid(True, alpha=0.3) plt.tight_layout() plt.savefig('ridge_regularization.png', dpi=150) plt.show()
if __name__ == "__main__": demo_linear_regression()
====================================================================== Comparison of Linear Regression Methods ======================================================================
Normal Equation: Train R ²: 0.9452 Test R ²: 0.9387 Test RMSE: 19.87
Batch GD: Train R ²: 0.9452 Test R ²: 0.9387 Test RMSE: 19.87
SGD: Train R ²: 0.9449 Test R ²: 0.9385 Test RMSE: 19.91
Mini-batch GD: Train R ²: 0.9451 Test R ²: 0.9386 Test RMSE: 19.88
====================================================================== Ridge Regression: Regularization Parameter Selection ====================================================================== λ= 0.00: Train R ²=0.9452, Test R ²=0.9387 λ= 0.01: Train R ²=0.9452, Test R ²=0.9387 λ= 0.10: Train R ²=0.9451, Test R ²=0.9388 λ= 1.00: Train R ²=0.9442, Test R ²=0.9391 λ= 10.00: Train R ²=0.9361, Test R ²=0.9372 λ=100.00: Train R ²=0.8647, Test R ²=0.8723
Q&A: Core Questions
on Linear Regression
Q1: Why use
squared loss instead of absolute loss?
Mathematical Reasons:
Differentiability:is differentiable everywhere,
convenient for optimization
Analytical Solution: Squared loss leads to
quadratic optimization with closed-form solution
Statistical Meaning: Under Gaussian noise
assumption, squared loss corresponds to maximum likelihood
estimation
Absolute Loss (L1 loss): -
Advantage: Robust to outliers -
Disadvantage: Non-differentiable at 0, no analytical
solution, requires linear programming
Comparison:
Loss Function
Differentiability
Analytical Solution
Outlier Robustness
Noise Distribution
Squared (L2)
Everywhere
Yes
Weak
Gaussian
Absolute (L1)
Not at 0
No
Strong
Laplace
Huber
Everywhere
No
Medium
Mixed
Huber Loss (Compromise):Quadratic when, linear when, combining advantages of both.
Q2:
Normal equation vs Gradient descent — when to use which?
Decision Tree:
1 2 3 4 5 6 7 8 9 10
Data scale? ├─ Small (m < 10000, d < 1000) │ └─ Use normal equation (fast, accurate) └─ Large (m > 10000 or d > 1000) ├─ Is X^T X invertible? │ ├─ Yes → Gradient descent │ └─ No → Ridge regression or gradient descent └─ Enough memory? ├─ Yes → Batch gradient descent └─ No → SGD or Mini-batch GD
Detailed Comparison:
Dimension
Normal Equation
Gradient Descent
Time Complexity
(iterations)
Space Complexity
Convergence
One-step
Multiple iterations
Hyperparameters
None
Learning rate, iterations
Feature Count
feasible
Arbitrary large
Invertibility
Requiresinvertible
No requirement
Regularization
Easy to add
Easy to add
Online Learning
Not supported
Supported (SGD)
Practical Recommendations:
Default:use normal equation, otherwise gradient descent
Big Data: Must use SGD or Mini-batch GD
Real-time Updates: Must use SGD (supports online
learning)
Q3: Why
does Ridge regression always have a solution?
Core Reason: Addingmakes the matrix positive
definite.
Theorem: For any, the matrixis positive definite, thus invertible.
Proof:
For any non-zero vector:Since,. Even if(in the null space of), since:Thereforeis positive definite, invertible.
Intuition:
-might be singular (e.g.,
collinear features) - Addingis like adding a "safety cushion" on the diagonal - Even if
some features are completely correlated,ensures positive definiteness
Geometric Meaning:
In feature space, the null space ofcorresponds to directions that cannot be determined from data.
Ridge regression applies a "preference for 0" prior in these directions
by adding, making the
problem well-posed.
Based on bias-variance tradeoff:whereis noise variance,is the true
parameter.
Practical Experience:
Starting Point:(on standardized data)
Range:(logarithmic scale search)
Fine-tuning: Narrow range around optimal value
Early Stopping: Stop if validation error increases
continuously
L-Curve Method:
Plotvs, selectat the "elbow" of the curve.
Q5: Why is feature
standardization important?
Problem: Different feature scales lead to:
Slow Gradient Descent Convergence: Parameter space
is ellipsoidal, gradient direction doesn't point to optimum
Unfair Regularization:penalizes weights of large-scale
features more
Example:
Suppose two features:
-: Area (range 0-1000 square
meters) -: Number of rooms
(range 1-10)
Weights,may have the same impact,
butmainly penalizes.
Standardization Methods:
Z-score Standardization (Recommended):where,.
Min-Max Standardization:
Effect Comparison:
Method
Mean
Std/Range
Outlier Sensitivity
Raw Data
Arbitrary
Arbitrary
-
Z-score
0
1
Medium
Min-Max
-
[0,1]
High
Code:
1 2 3 4 5
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Use train mean and std
Note:
After standardization, weight interpretation changes. To recover
original-scale weights:
Q6: How
does multicollinearity affect linear regression?
Definition: Multicollinearity refers to high linear
correlation among features.
Detection Methods:
Variance Inflation Factor (VIF):whereis the
coefficient of determination when predicting featurewith other features.
Judgment Criteria:
-: No
collinearity -: Moderate collinearity -: Severe
collinearity
Impact:
Numerical Instability:nearly singular, large error in
computing
Large Parameter Variance: Standard error of
weight estimates increases, confidence intervals widen
Uninterpretable Parameters: Weight signs may
contradict expectations
Example:
Suppose(nearly
collinear). Then:
True model:
Estimated model might get:(unstable parameters)
Solutions:
Remove Redundant Features: Identify and remove via
VIF
PCA Dimensionality Reduction: Transform correlated
features into orthogonal principal components
Ridge Regression:penalty mitigates
collinearity
Collect More Data: Increase sample size to reduce
estimation variance
Python VIF Detection:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Compute VIF for each feature vif_data = [] for i inrange(X.shape[1]): vif = variance_inflation_factor(X, i) vif_data.append({'Feature': f'X{i+1}', 'VIF': vif})
import pandas as pd df_vif = pd.DataFrame(vif_data) print(df_vif)
# Remove features with VIF > 10 high_vif = df_vif[df_vif['VIF'] > 10]['Feature'].tolist() print(f"High VIF features (consider removal): {high_vif}")
Q7:
What are the assumptions of linear regression? What if violated?
Four Key Assumptions of Classical Linear
Regression:
Assumption 1: Linearity
Check: Residual plot should show no obvious
pattern.
When Violated:
Add polynomial features:
Feature transformations:
Use nonlinear models: decision trees, neural networks
Multicollinearity:one-hot columns are completely linearly
dependent (sum to 1). Solution: drop one column
(drop='first') or use Ridge regression.
High Cardinality Features: If category countis large (e.g., ZIP codes), consider:
Target Encoding
Embeddings
Q9: Can
linear regression handle nonlinear relationships?
Answer: Yes, through feature engineering.
Key Insight: "Linear" in linear regression refers to
parameter linearity, not feature linearity.This is still a linear model in, solvable by linear regression.
Method 1: Polynomial Features
1 2 3 4 5 6 7
from sklearn.preprocessing import PolynomialFeatures
Optimization Algorithms: Small data use normal
equation, big data use gradient descent
Model Diagnostics: Check linearity,
independence, homoscedasticity, normality
Practical Points:
Feature standardization
Handle collinearity
Select regularization parameter
Cross-validation evaluation
Next Chapter Preview: Chapter 6 will explore
logistic regression and classification problems, extending linear models
to discrete output spaces. We will derive the origin of the sigmoid
function, mathematical foundation of cross-entropy loss, and geometric
meaning of decision boundaries.
References
Hastie, T., Tibshirani, R., & Friedman, J. (2009).
The Elements of Statistical Learning (2nd ed.).
Springer.
Bishop, C. M. (2006). Pattern Recognition and Machine
Learning. Springer.
Murphy, K. P. (2012). Machine Learning: A Probabilistic
Perspective. MIT Press.
Shalev-Shwartz, S., & Ben-David, S. (2014).
Understanding Machine Learning: From Theory to
Algorithms. Cambridge University Press.
Goodfellow, I., Bengio, Y., & Courville, A. (2016).
Deep Learning. MIT Press.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013).
An Introduction to Statistical Learning.
Springer.
Tibshirani, R. (1996). Regression shrinkage and selection
via the lasso. Journal of the Royal Statistical Society:
Series B, 58(1), 267-288.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge
regression: Biased estimation for nonorthogonal problems.
Technometrics, 12(1), 55-67.
Zou, H., & Hastie, T. (2005). Regularization and
variable selection via the elastic net. Journal of the
Royal Statistical Society: Series B, 67(2), 301-320.
Huber, P. J. (1964). Robust estimation of a location
parameter. The Annals of Mathematical Statistics,
35(1), 73-101.
Li, H. (2012). Statistical Learning Methods.
Tsinghua University Press.
Zhou, Z. H. (2016). Machine Learning. Tsinghua
University Press.
Post title:Mathematical Derivation of Machine Learning (5): Linear Regression
Post author:Chen Kai
Create time:2025-10-25 00:00:00
Post link:https://www.chenk.top/ml-math-05-linear-regression/
Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.