In 1886, Francis Galton discovered a peculiar phenomenon while studying the relationship between parent and child heights: children of extremely tall or short parents tended to have heights closer to the average. He coined the term "regression toward the mean," which is the origin of the word "regression." However, the true power of linear regression lies not in statistical description, but in being the mathematical foundation for almost all machine learning algorithms — from neural networks to support vector machines, all can be viewed as generalizations of linear regression.

The essence of linear regression is finding the optimal hyperplane in the data space. This seemingly simple problem conceals deep connections among linear algebra, probability theory, and optimization theory. This chapter provides a complete mathematical derivation of linear regression from multiple perspectives.

Basic Form of Linear Regression

Problem Definition

Given a training dataset, where:

-is a-dimensional input vector (features) -is a scalar output (target value)

Objective: Find parameter vectorand biassuch that the linear model:best fits the training data.

Notation Simplification: To unify representation, we absorb the bias into the weight vector. Define the augmented feature vector:The model simplifies to:For brevity, we omit the tilde and write, understanding thatincludes the constant term 1.

Matrix Form

Organize all training samples into matrix form:

Design Matrix:

Output Vector:

Prediction Vector:Our goal is to find optimalsuch that the prediction vectoris as close as possible to the true value.

Least Squares Method: Algebraic Derivation

Loss Function

Use squared loss (L2 loss) to measure prediction error:The coefficientis for canceling constants during differentiation.

Objective:

Gradient Derivation

Compute the gradient of the loss function with respect to. Using the chain rule:Expand:Differentiate with respect to (using matrix calculus formulas): Therefore:

Normal Equation

Set the gradient to zero:This is the famous Normal Equation.

Theorem 1 (Least Squares Solution): Ifis invertible, the unique solution to the least squares problem is:

Proof:

First-order necessary condition:gives
Second-order sufficient condition: Compute the Hessian matrix:For any non-zero vector:Ifhas full column rank (i.e.,), thenif and only if, thusis positive definite.
Positive definite Hessian + zero gradient = global minimum. QED.

Invertibility Condition for:

Necessary and sufficient condition:has full column rank, i.e.,
Equivalent condition:is positive definite
Practical meaning:
- Number of samples(samples must exceed features)
- Features are linearly independent (no perfect collinearity)

Moore-Penrose Pseudoinverse

Whenis not invertible (e.g.,or collinear features), use the pseudoinverse:whereis the Moore-Penrose pseudoinverse.

Properties:

Whenis invertible,(reduces to ordinary inverse) -is the minimum norm solution among all solutions satisfying:

Computation Method: Through Singular Value Decomposition (SVD). Let, then:whereis the pseudoinverse of(invert non-zero singular values, keep zeros as zeros).

Geometric Interpretation: Projection Perspective

Column Space and Projection

The geometric essence of linear regression is orthogonal projection.

Definitions:

-is the column space of, the subspace spanned by columns of -is a vector in - The goal is to find the point inclosest to Theorem 2 (Orthogonal Projection Theorem):is the orthogonal projection ofontoif and only if the residual vectoris orthogonal to:This is precisely the normal equation!

Proof:

For any, consider the squared norm of the prediction error:Using the Pythagorean theorem (when two vectors are orthogonal):Let,. If, then:Therefore:Equality holds if and only if. QED.

Projection Matrix

Definition: The projection matrixprojects any vector onto:

Properties:

Idempotency:(projecting twice equals projecting once)
Symmetry:
Action:is the projection ofonto the column space

Residual Projection Matrix:Satisfies:

Properties:

-(idempotent) -(symmetric) -(residuals orthogonal to column space)

Geometric Intuition

In-dimensional space:

-is the true output vector -is a-dimensional subspace (ifhas full column rank) -is the "shadow" ofon -is the residual perpendicular to Analogy (2D plane projection):

Imagine in 3D space, the perpendicular distance from a pointto a 2D plane. The projection pointminimizes the distance.

Probabilistic Perspective: Maximum Likelihood Estimation

Linear Gaussian Model

Assume the data generation process is:whereare independent and identically distributed Gaussian noise.

Equivalent form:That is, given,follows a Gaussian distribution with meanand variance.

Likelihood Function

Given training data, the likelihood function of parametersis:

Log-Likelihood

Take the logarithm (monotonic transformation doesn't change the maximizer):

Maximum Likelihood Estimation

Optimization with respect to:

Maximizingis equivalent to minimizing:This is precisely the least squares objective!

Theorem 3: Under the linear Gaussian model, maximum likelihood estimation is equivalent to least squares estimation:

Optimization with respect to:

Fixing, differentiate with respect to:Solving:That is, the maximum likelihood estimate of noise variance is the mean of residual sum of squares.

Bayesian Perspective

Introduce a prior distributionfor parameters, compute the posterior through Bayes' rule:

Gaussian Prior: Assume, then:Maximizing the posterior probability (MAP) is equivalent to minimizing:This is precisely the objective function of Ridge Regression! The regularization termreflects the strength of the prior belief.

Regularization: Ridge and Lasso Regression

Ridge Regression (L2 Regularization)

Objective Function:whereis the regularization parameter.

Gradient:Set gradient to zero:

Analytical Solution:

Key Observations:

Addingensures invertibility of(even whenis not invertible)
When, reduces to ordinary least squares
When,(extreme regularization)

Matrix Perspective: Ridge regression "stabilizes" the matrixby adding a diagonal term, avoiding ill-conditioned problems.

Lasso Regression (L1 Regularization)

Objective Function:whereis the L1 norm.

Characteristics:

No analytical solution (L1 norm is non-differentiable)
Requires iterative algorithms (e.g., coordinate descent, proximal gradient)
Sparsity: Some parameters are exactly compressed to 0, achieving feature selection

Geometric Interpretation:

In the constrained form:The L1 constraint ball is diamond-shaped (hyperdiamond in high dimensions), whose sharp corners are more likely to intersect with contour lines at coordinate axes, leading to sparse solutions.

Elastic Net

Combines L1 and L2:

Advantages:

Retains L1 sparsity
Retains L2 stability (friendly to collinear features)

Effect of Regularization

Bias-Variance Tradeoff:

No regularization (): Low bias, high variance (overfitting)
Strong regularization (large): High bias, low variance (underfitting)
Optimal: Selected through cross-validation

Ridge Trace Plot:

Plotversusto observe parameter shrinkage behavior.

Gradient Descent Algorithm

Batch Gradient Descent (BGD)

When data is large, directly computingis too expensive (). Use iterative algorithms.

Algorithm:

Initialize$w^{(0)}$

Complete Form:

Computational Complexity: Each iteration(matrix-vector multiplication).

Stochastic Gradient Descent (SGD)

Use only one sample to update parameters each time.

Algorithm:

Initialize$w^{(0)}t = 0, 1, 2, (x_i, y_i)$ - Update:

Advantages:

Fast per iteration ()
Suitable for large-scale data
Can escape local minima (for non-convex problems)

Disadvantages:

Unstable convergence (high variance)
Requires careful learning rate tuning

Mini-Batch Gradient Descent

Compromise: Usesamples each time (batch size).whereis the batch for iteration(size).

Typical Choice:.

Convergence Analysis

Theorem 4 (BGD Convergence): For learning rate, batch gradient descent converges linearly to the optimal solution:where,are the minimum and maximum eigenvalues of.

Proof Sketch:

Loss function is strongly convex (positive definite Hessian)
Gradient is Lipschitz continuous
Apply convergence theorem for strongly convex functions

Practical Recommendations:

Learning rate:(conservative choice)
Adaptive learning rate: Adam, RMSprop, etc.
Learning rate decay: # Model Evaluation and Selection

Evaluation Metrics

Mean Squared Error (MSE):

Root Mean Squared Error (RMSE):

Mean Absolute Error (MAE):

Coefficient of Determination (R ²):whereis the mean.

Interpretation:

-: Perfect fit -: Model equivalent to predicting mean -: Model worse than mean (rare, indicates model issues)

Adjusted R ²

Considers model complexity penalty:

Advantage: When adding useless features,decreases (whilealways increases).

Cross-Validation

k-Fold Cross-Validation:

Split data intofolds
For:
- Train on all data except fold - Test on fold$ik$test errors

Python Implementation:

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

scores = cross_val_score(LinearRegression(), X, y, cv=5, 
                         scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-scores)
print(f"RMSE: {rmse_scores.mean():.4f} ± {rmse_scores.std():.4f}")

Complete Code Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

class LinearRegression:
    """
    Complete implementation of linear regression
    
    Supports:
    - Analytical solution (normal equation)
    - Batch gradient descent
    - Stochastic gradient descent
    - Ridge regularization
    """
    
    def __init__(self, method='normal', alpha=0.01, n_iterations=1000, 
                 lambda_reg=0.0, batch_size=None, random_state=42):
        """
        Parameters:
            method: str, solving method
                'normal': Normal equation
                'bgd': Batch gradient descent
                'sgd': Stochastic gradient descent
                'mini_batch': Mini-batch gradient descent
            alpha: float, learning rate (for gradient descent only)
            n_iterations: int, number of iterations
            lambda_reg: float, Ridge regularization parameter
            batch_size: int, batch size (for mini_batch only)
            random_state: int, random seed
        """
        self.method = method
        self.alpha = alpha
        self.n_iterations = n_iterations
        self.lambda_reg = lambda_reg
        self.batch_size = batch_size
        self.random_state = random_state
        self.w = None
        self.loss_history = []
        
    def fit(self, X, y):
        """
        Train linear regression model
        
        Parameters:
            X: np.array, shape=(m, d), input features
            y: np.array, shape=(m,), output labels
        """
        # Add bias term
        X_bias = self._add_bias(X)
        m, d = X_bias.shape
        
        # Initialize weights
        np.random.seed(self.random_state)
        self.w = np.random.randn(d) * 0.01
        
        if self.method == 'normal':
            # Normal equation
            self.w = self._normal_equation(X_bias, y)
        elif self.method == 'bgd':
            # Batch gradient descent
            self._batch_gradient_descent(X_bias, y)
        elif self.method == 'sgd':
            # Stochastic gradient descent
            self._stochastic_gradient_descent(X_bias, y)
        elif self.method == 'mini_batch':
            # Mini-batch gradient descent
            if self.batch_size is None:
                self.batch_size = min(32, m)
            self._mini_batch_gradient_descent(X_bias, y)
        else:
            raise ValueError(f"Unknown method: {self.method}")
        
        return self
    
    def predict(self, X):
        """
        Prediction
        
        Parameters:
            X: np.array, shape=(m, d)
        
        Returns:
            y_pred: np.array, shape=(m,)
        """
        X_bias = self._add_bias(X)
        return X_bias @ self.w
    
    def score(self, X, y):
        """
        Compute R ² score
        """
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - ss_res / ss_tot
    
    def _add_bias(self, X):
        """Add bias column"""
        return np.c_[X, np.ones(X.shape[0])]
    
    def _compute_loss(self, X, y):
        """Compute loss (including regularization)"""
        m = len(y)
        predictions = X @ self.w
        mse = np.mean((predictions - y) ** 2)
        reg = self.lambda_reg * np.sum(self.w[:-1] ** 2)  # Don't regularize bias
        return mse + reg
    
    def _compute_gradient(self, X, y):
        """Compute gradient"""
        m = len(y)
        predictions = X @ self.w
        gradient = X.T @ (predictions - y) / m
        # Add regularization gradient (excluding bias)
        if self.lambda_reg > 0:
            reg_gradient = np.zeros_like(self.w)
            reg_gradient[:-1] = 2 * self.lambda_reg * self.w[:-1]
            gradient += reg_gradient
        return gradient
    
    def _normal_equation(self, X, y):
        """Normal equation solver"""
        # w = (X^T X + lambda*I)^{-1} X^T y
        d = X.shape[1]
        reg_matrix = self.lambda_reg * np.eye(d)
        reg_matrix[-1, -1] = 0  # Don't regularize bias
        
        try:
            w = np.linalg.solve(X.T @ X + reg_matrix, X.T @ y)
        except np.linalg.LinAlgError:
            # If matrix is singular, use pseudoinverse
            w = np.linalg.pinv(X.T @ X + reg_matrix) @ X.T @ y
        
        return w
    
    def _batch_gradient_descent(self, X, y):
        """Batch gradient descent"""
        for i in range(self.n_iterations):
            gradient = self._compute_gradient(X, y)
            self.w -= self.alpha * gradient
            
            # Record loss
            if i % 10 == 0:
                loss = self._compute_loss(X, y)
                self.loss_history.append(loss)
    
    def _stochastic_gradient_descent(self, X, y):
        """Stochastic gradient descent"""
        m = len(y)
        np.random.seed(self.random_state)
        
        for i in range(self.n_iterations):
            # Randomly select one sample
            idx = np.random.randint(m)
            X_i = X[idx:idx+1]
            y_i = y[idx:idx+1]
            
            gradient = self._compute_gradient(X_i, y_i)
            self.w -= self.alpha * gradient
            
            # Record loss (every 10 iterations)
            if i % 10 == 0:
                loss = self._compute_loss(X, y)
                self.loss_history.append(loss)
    
    def _mini_batch_gradient_descent(self, X, y):
        """Mini-batch gradient descent"""
        m = len(y)
        np.random.seed(self.random_state)
        
        for i in range(self.n_iterations):
            # Randomly select batch
            indices = np.random.choice(m, self.batch_size, replace=False)
            X_batch = X[indices]
            y_batch = y[indices]
            
            gradient = self._compute_gradient(X_batch, y_batch)
            self.w -= self.alpha * gradient
            
            # Record loss
            if i % 10 == 0:
                loss = self._compute_loss(X, y)
                self.loss_history.append(loss)


# Example: House price prediction
def demo_linear_regression():
    """
    Complete example: Linear regression on housing price data
    """
    # Generate synthetic data (simulating house price prediction)
    np.random.seed(42)
    m = 500
    d = 5
    
    # True weights
    w_true = np.array([50, -20, 30, 15, -10, 200])  # Last one is bias
    
    # Generate features (standardized)
    X = np.random.randn(m, d)
    
    # Add bias and compute true values
    X_bias = np.c_[X, np.ones(m)]
    y_true = X_bias @ w_true
    
    # Add noise
    noise = np.random.randn(m) * 20
    y = y_true + noise
    
    # Split train and test
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # Standardize
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Compare different methods
    methods = {
        'Normal Equation': LinearRegression(method='normal'),
        'Batch GD': LinearRegression(method='bgd', alpha=0.1, n_iterations=1000),
        'SGD': LinearRegression(method='sgd', alpha=0.01, n_iterations=2000),
        'Mini-batch GD': LinearRegression(method='mini_batch', alpha=0.05, 
                                         n_iterations=1000, batch_size=32)
    }
    
    results = {}
    
    print("=" * 70)
    print("Comparison of Linear Regression Methods")
    print("=" * 70)
    
    for name, model in methods.items():
        # Train
        model.fit(X_train_scaled, y_train)
        
        # Evaluate
        train_score = model.score(X_train_scaled, y_train)
        test_score = model.score(X_test_scaled, y_test)
        y_pred = model.predict(X_test_scaled)
        rmse = np.sqrt(np.mean((y_test - y_pred) ** 2))
        
        results[name] = {
            'train_r2': train_score,
            'test_r2': test_score,
            'rmse': rmse,
            'weights': model.w
        }
        
        print(f"\n{name}:")
        print(f"  Train R ²: {train_score:.4f}")
        print(f"  Test R ²: {test_score:.4f}")
        print(f"  Test RMSE: {rmse:.2f}")
    
    # Visualize: Loss curves
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Loss curves
    ax = axes[0]
    for name in ['Batch GD', 'SGD', 'Mini-batch GD']:
        model = methods[name]
        if len(model.loss_history) > 0:
            ax.plot(model.loss_history, label=name, alpha=0.7)
    ax.set_xlabel('Iteration (×10)')
    ax.set_ylabel('Loss')
    ax.set_title('Convergence of Gradient Descent Methods')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Prediction vs True values
    ax = axes[1]
    model = methods['Normal Equation']
    y_pred = model.predict(X_test_scaled)
    ax.scatter(y_test, y_pred, alpha=0.5)
    ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
            'r--', lw=2, label='Perfect Prediction')
    ax.set_xlabel('True Values')
    ax.set_ylabel('Predictions')
    ax.set_title('Predictions vs True Values')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('linear_regression_demo.png', dpi=150)
    plt.show()
    
    # Ridge regression: Regularization effect
    print("\n" + "=" * 70)
    print("Ridge Regression: Regularization Parameter Selection")
    print("=" * 70)
    
    lambdas = [0, 0.01, 0.1, 1, 10, 100]
    train_scores = []
    test_scores = []
    
    for lam in lambdas:
        model = LinearRegression(method='normal', lambda_reg=lam)
        model.fit(X_train_scaled, y_train)
        train_scores.append(model.score(X_train_scaled, y_train))
        test_scores.append(model.score(X_test_scaled, y_test))
        print(f"λ={lam:6.2f}: Train R ²={train_scores[-1]:.4f}, "
              f"Test R ²={test_scores[-1]:.4f}")
    
    # Visualize regularization effect
    plt.figure(figsize=(10, 6))
    plt.plot(lambdas, train_scores, 'o-', label='Training R ²', linewidth=2)
    plt.plot(lambdas, test_scores, 's-', label='Testing R ²', linewidth=2)
    plt.xscale('log')
    plt.xlabel('Regularization Parameter λ')
    plt.ylabel('R ² Score')
    plt.title('Effect of Ridge Regularization')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig('ridge_regularization.png', dpi=150)
    plt.show()


if __name__ == "__main__":
    demo_linear_regression()

Sample Output:

======================================================================
Comparison of Linear Regression Methods
======================================================================

Normal Equation:
  Train R ²: 0.9452
  Test R ²: 0.9387
  Test RMSE: 19.87

Batch GD:
  Train R ²: 0.9452
  Test R ²: 0.9387
  Test RMSE: 19.87

SGD:
  Train R ²: 0.9449
  Test R ²: 0.9385
  Test RMSE: 19.91

Mini-batch GD:
  Train R ²: 0.9451
  Test R ²: 0.9386
  Test RMSE: 19.88

======================================================================
Ridge Regression: Regularization Parameter Selection
======================================================================
λ=  0.00: Train R ²=0.9452, Test R ²=0.9387
λ=  0.01: Train R ²=0.9452, Test R ²=0.9387
λ=  0.10: Train R ²=0.9451, Test R ²=0.9388
λ=  1.00: Train R ²=0.9442, Test R ²=0.9391
λ= 10.00: Train R ²=0.9361, Test R ²=0.9372
λ=100.00: Train R ²=0.8647, Test R ²=0.8723

Q&A: Core Questions on Linear Regression

Q1: Why use squared loss instead of absolute loss?

Mathematical Reasons:

Differentiability:is differentiable everywhere, convenient for optimization
Analytical Solution: Squared loss leads to quadratic optimization with closed-form solution
Statistical Meaning: Under Gaussian noise assumption, squared loss corresponds to maximum likelihood estimation

Absolute Loss (L1 loss): - Advantage: Robust to outliers - Disadvantage: Non-differentiable at 0, no analytical solution, requires linear programming

Comparison:

Loss Function	Differentiability	Analytical Solution	Outlier Robustness	Noise Distribution
Squared (L2)	Everywhere	Yes	Weak	Gaussian
Absolute (L1)	Not at 0	No	Strong	Laplace
Huber	Everywhere	No	Medium	Mixed

Huber Loss (Compromise):Quadratic when, linear when, combining advantages of both.

Q2: Normal equation vs Gradient descent — when to use which?

Decision Tree:

Data scale?
├─ Small (m < 10000, d < 1000)
│  └─ Use normal equation (fast, accurate)
└─ Large (m > 10000 or d > 1000)
   ├─ Is X^T X invertible?
   │  ├─ Yes → Gradient descent
   │  └─ No → Ridge regression or gradient descent
   └─ Enough memory?
      ├─ Yes → Batch gradient descent
      └─ No → SGD or Mini-batch GD

Detailed Comparison:

Dimension	Normal Equation	Gradient Descent
Time Complexity		(iterations)
Space Complexity
Convergence	One-step	Multiple iterations
Hyperparameters	None	Learning rate, iterations
Feature Count	feasible	Arbitrary large
Invertibility	Requiresinvertible	No requirement
Regularization	Easy to add	Easy to add
Online Learning	Not supported	Supported (SGD)

Practical Recommendations:

Default:use normal equation, otherwise gradient descent
Big Data: Must use SGD or Mini-batch GD
Real-time Updates: Must use SGD (supports online learning)

Q3: Why does Ridge regression always have a solution?

Core Reason: Addingmakes the matrix positive definite.

Theorem: For any, the matrixis positive definite, thus invertible.

Proof:

For any non-zero vector:Since,. Even if(in the null space of), since:Thereforeis positive definite, invertible.

Intuition:

-might be singular (e.g., collinear features) - Addingis like adding a "safety cushion" on the diagonal - Even if some features are completely correlated,ensures positive definiteness

Geometric Meaning:

In feature space, the null space ofcorresponds to directions that cannot be determined from data. Ridge regression applies a "preference for 0" prior in these directions by adding, making the problem well-posed.

Q4: How to choose regularization parameter?

Method 1: Cross-Validation (Recommended)

from sklearn.linear_model import RidgeCV

# Automatically select optimal lambda
lambdas = np.logspace(-3, 3, 50)
model = RidgeCV(alphas=lambdas, cv=5)
model.fit(X_train, y_train)
print(f"Optimal λ: {model.alpha_:.4f}")

Method 2: Grid Search

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

param_grid = {'alpha': np.logspace(-3, 3, 20)}
grid = GridSearchCV(Ridge(), param_grid, cv=5, scoring='neg_mean_squared_error')
grid.fit(X_train, y_train)
print(f"Optimal λ: {grid.best_params_['alpha']:.4f}")

Method 3: Theoretical Guidance

Based on bias-variance tradeoff:whereis noise variance,is the true parameter.

Practical Experience:

Starting Point:(on standardized data)
Range:(logarithmic scale search)
Fine-tuning: Narrow range around optimal value
Early Stopping: Stop if validation error increases continuously

L-Curve Method:

Plotvs, selectat the "elbow" of the curve.

Q5: Why is feature standardization important?

Problem: Different feature scales lead to:

Slow Gradient Descent Convergence: Parameter space is ellipsoidal, gradient direction doesn't point to optimum
Unfair Regularization:penalizes weights of large-scale features more

Example:

Suppose two features:

-: Area (range 0-1000 square meters) -: Number of rooms (range 1-10)

Weights,may have the same impact, butmainly penalizes.

Standardization Methods:

Z-score Standardization (Recommended):where,.

Min-Max Standardization:

Effect Comparison:

Method	Mean	Std/Range	Outlier Sensitivity
Raw Data	Arbitrary	Arbitrary	-
Z-score	0	1	Medium
Min-Max	-	[0,1]	High

Code:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use train mean and std

Note: After standardization, weight interpretation changes. To recover original-scale weights:

Q6: How does multicollinearity affect linear regression?

Definition: Multicollinearity refers to high linear correlation among features.

Detection Methods:

Variance Inflation Factor (VIF):whereis the coefficient of determination when predicting featurewith other features.

Judgment Criteria:

-: No collinearity -: Moderate collinearity -: Severe collinearity

Impact:

Numerical Instability:nearly singular, large error in computing
Large Parameter Variance: Standard error of weight estimates increases, confidence intervals widen
Uninterpretable Parameters: Weight signs may contradict expectations

Example:

Suppose(nearly collinear). Then:

True model:
Estimated model might get:(unstable parameters)

Solutions:

Remove Redundant Features: Identify and remove via VIF
PCA Dimensionality Reduction: Transform correlated features into orthogonal principal components
Ridge Regression:penalty mitigates collinearity
Collect More Data: Increase sample size to reduce estimation variance

Python VIF Detection:

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Compute VIF for each feature
vif_data = []
for i in range(X.shape[1]):
    vif = variance_inflation_factor(X, i)
    vif_data.append({'Feature': f'X{i+1}', 'VIF': vif})

import pandas as pd
df_vif = pd.DataFrame(vif_data)
print(df_vif)

# Remove features with VIF > 10
high_vif = df_vif[df_vif['VIF'] > 10]['Feature'].tolist()
print(f"High VIF features (consider removal): {high_vif}")

Q7: What are the assumptions of linear regression? What if violated?

Four Key Assumptions of Classical Linear Regression:

Assumption 1: Linearity

Check: Residual plot should show no obvious pattern.

When Violated:

Add polynomial features:
Feature transformations:
Use nonlinear models: decision trees, neural networks

Assumption 2: Independenceare mutually independent.

Check: Durbin-Watson test statistic: indicates no autocorrelation.

When Violated (common in time series):

Use autoregressive models (AR, ARIMA)
Add lagged terms as features

Assumption 3: Homoscedasticity(constant).

Check: In residual plot, residual spread should remain constant across.

When Violated (heteroscedasticity):

Weighted Least Squares (WLS): different weights for different samples
Robust Standard Errors
Log-transform target variable

Assumption 4: Normality.

Check:

QQ plot (Quantile-Quantile Plot)
Shapiro-Wilk test

When Violated:

Transform target variable (Box-Cox transformation)
Use non-parametric methods or generalized linear models

Diagnostic Code:

import matplotlib.pyplot as plt
from scipy import stats

# Fit model
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
residuals = y_train - y_pred

# Residual plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Residuals vs Fitted
axes[0, 0].scatter(y_pred, residuals, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')

# 2. QQ Plot
stats.probplot(residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q Plot')

# 3. Scale-Location Plot (check homoscedasticity)
standardized_residuals = residuals / np.std(residuals)
axes[1, 0].scatter(y_pred, np.sqrt(np.abs(standardized_residuals)), alpha=0.5)
axes[1, 0].set_xlabel('Fitted Values')
axes[1, 0].set_ylabel('√|Standardized Residuals|')
axes[1, 0].set_title('Scale-Location Plot')

# 4. Residual Histogram
axes[1, 1].hist(residuals, bins=30, edgecolor='black')
axes[1, 1].set_xlabel('Residuals')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Histogram of Residuals')

plt.tight_layout()
plt.show()

Q8: How to handle categorical features?

Problem: Linear regression requires numerical inputs, but many features are categorical (e.g., "color": red, green, blue).

Wrong Approach: Directly encode as integers (red=1, green=2, blue=3).

Issue: Introduces false ordinal relationships (model thinks blue is "2 times larger" than red).

Correct Method:

One-Hot Encoding:

Convert-category feature intobinary features.

Example:

Original	Red	Green	Blue
Red	1	0	0
Green	0	1	0
Blue	0	0	1

Code:

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Method 1: pandas
data = pd.DataFrame({'color': ['red', 'green', 'blue', 'red']})
encoded = pd.get_dummies(data['color'], prefix='color')
print(encoded)

# Method 2: sklearn
encoder = OneHotEncoder(sparse=False, drop='first')  # drop='first' avoids multicollinearity
X_encoded = encoder.fit_transform(data[['color']])

Notes:

Multicollinearity:one-hot columns are completely linearly dependent (sum to 1). Solution: drop one column (drop='first') or use Ridge regression.
High Cardinality Features: If category countis large (e.g., ZIP codes), consider:
- Target Encoding
- Embeddings

Q9: Can linear regression handle nonlinear relationships?

Answer: Yes, through feature engineering.

Key Insight: "Linear" in linear regression refers to parameter linearity, not feature linearity.This is still a linear model in, solvable by linear regression.

Method 1: Polynomial Features

from sklearn.preprocessing import PolynomialFeatures

# Generate polynomial features (degree 3)
poly = PolynomialFeatures(degree=3, include_bias=False)
X_poly = poly.fit_transform(X)

# Example: [x1, x2] -> [x1, x2, x1^2, x1*x2, x2^2, x1^3, x1^2*x2, x1*x2^2, x2^3]

Method 2: Interaction Features

from sklearn.preprocessing import PolynomialFeatures

# Generate only interaction terms
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_inter = poly.fit_transform(X)

# Example: [x1, x2, x3] -> [x1, x2, x3, x1*x2, x1*x3, x2*x3]

Method 3: Custom Transformations

import numpy as np

# Add log, exp, etc. transformations
X_transformed = np.c_[
    X,
    np.log(X + 1),  # Log
    np.sqrt(X),     # Square root
    X ** 2,         # Square
    np.sin(X)       # Trigonometric
]

Cautions:

Overfitting Risk: Feature count grows rapidly (features of degreehasterms)
Regularization Necessity: Use Ridge or Lasso to control complexity
Interpretability Decline: High-degree terms hard to interpret

Example: Fitting Sine Curve

# Generate data
X = np.linspace(0, 4*np.pi, 100).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.randn(100) * 0.1

# Polynomial fitting
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge

model = Pipeline([
    ('poly', PolynomialFeatures(degree=9)),
    ('ridge', Ridge(alpha=0.1))
])

model.fit(X, y)
y_pred = model.predict(X)

# Plot
plt.scatter(X, y, alpha=0.5, label='Data')
plt.plot(X, y_pred, 'r-', linewidth=2, label='Polynomial Fit')
plt.plot(X, np.sin(X), 'g--', label='True Function')
plt.legend()
plt.show()

Q10: How to interpret linear regression coefficients?

Basic Interpretation:

For model:

Meaning of: When other features remain constant,increases by 1 unit,changes byunits on average.

Cautions:

Standardization Impact: If features are standardized, weight magnitudes cannot directly compare importance.
Collinearity Impact: Multicollinearity causes unstable weights.
Causality: Correlation ≠ causation. Weights only show statistical association, cannot infer causality.

Standardized Coefficients:

Fit model on standardized data, resulting coefficients can compare relative importance:

Interpretation:increases by 1 standard deviation,changes bystandard deviations.

Significance Testing:

Use t-test to determine ifis significantly different from 0:whereis the standard error. Ifis large (p-value small), reject null hypothesis.

Python Implementation:

import statsmodels.api as sm

# Add constant term
X_with_const = sm.add_constant(X)

# Fit model
model = sm.OLS(y, X_with_const).fit()

# Print summary
print(model.summary())

# Includes:
# - Coefficient estimates
# - Standard errors
# - t-statistics
# - p-values
# - Confidence intervals

Confidence Interval:

95% confidence interval:If interval doesn't include 0, parameter is significant.

Summary and Outlook

Core Points:

Three Perspectives: Algebraic (normal equation), geometric (projection), probabilistic (MLE) converge
Least Squares Solution:
Regularization: Ridge (L2) ensures invertibility, Lasso (L1) achieves sparsity
Optimization Algorithms: Small data use normal equation, big data use gradient descent
Model Diagnostics: Check linearity, independence, homoscedasticity, normality

Practical Points:

Feature standardization
Handle collinearity
Select regularization parameter
Cross-validation evaluation

Next Chapter Preview: Chapter 6 will explore logistic regression and classification problems, extending linear models to discrete output spaces. We will derive the origin of the sigmoid function, mathematical foundation of cross-entropy loss, and geometric meaning of decision boundaries.

References

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267-288.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301-320.
Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1), 73-101.
Li, H. (2012). Statistical Learning Methods. Tsinghua University Press.
Zhou, Z. H. (2016). Machine Learning. Tsinghua University Press.