Mathematical Derivations in Machine Learning (6): Logistic Regression and Classification

The leap from linear regression to logistic regression marks a crucial transition in machine learning from regression tasks to classification tasks. Despite its name containing "regression", logistic regression is actually a foundational classification algorithm that establishes a bridge between linear models and probabilistic predictions through the Sigmoid function. This chapter will deeply derive the mathematical essence of logistic regression: from the construction of likelihood functions to the details of gradient computation, from binary classification to multiclass generalization, from optimization algorithms to regularization techniques, comprehensively revealing the probabilistic modeling philosophy for classification problems.

From Linear Models to Probabilistic Classification

Limitations of Linear Classification

Recall that in linear regression, we established a linear mapping between inputs and continuous outputs:

However, classification task labels are discrete (e.g.,), and directly using a linear model has two problems:

Unconstrained output:can be any real number, but class labels need to be in a finite set
Missing probabilistic interpretation: Linear models cannot provide "the probability that a sample belongs to a class"

Logistic regression solves this contradiction by introducing a link function: mapping the output of the linear model to theinterval, endowing it with probabilistic meaning.

Sigmoid Function: From Real Numbers to Probabilities

The Sigmoid function is defined as:It has elegant mathematical properties:

Property 1: Range ConstraintFor any, we have, satisfying the value requirements for probabilities.

Property 2: Symmetry

Proof:

Property 3: Derivative Self-Representation

Proof:This property is key to the simplicity of gradient computation.

Definition of Logistic Regression Model

For a binary classification task (), define:Correspondingly:

Unified Representation: Using exponential form, the two probabilities can be combined as:When, the formula reduces to; when, it reduces to.

Maximum Likelihood Estimation and Loss Function

Construction of Likelihood Function

Given a training set, assuming samples are independent and identically distributed (i.i.d.), the likelihood function is:Substituting the logistic regression model:

Log-Likelihood and Cross-Entropy

Taking the logarithm gives the log-likelihood:Maximizing the log-likelihood is equivalent to minimizing the negative log-likelihood:whereis the predicted probability. This is precisely the Binary Cross-Entropy Loss.

Information-Theoretic Interpretation: Cross-entropy measures the difference between the true distributionand the predicted distribution:In binary classification, the true distributioncorresponds to hard labels, and the predicted distributioncorresponds to. Substituting gives the above loss.

Comparison with Mean Squared Error

If using Mean Squared Error (MSE) as the loss:Computing the gradient:Note the additional term. Whenis close to 0 or 1 (i.e., the model is very confident), this term approaches 0, causing vanishing gradients, with almost no updates even when predictions are incorrect.

The gradient of cross-entropy loss is (derived in the next section):Without theterm, it avoids the vanishing gradient problem.

Gradient Derivation and Optimization Algorithms

Precise Gradient Computation

For the loss of a single sample:where,. Using the chain rule:

Step 1:

Step 2: Using the Sigmoid derivative property

Step 3:

Combining:The total gradient is:whereis the data matrix, andare the predicted and true label vectors.

Hessian Matrix and Second-Order Methods

For Newton's method and other second-order optimization, we need to compute the Hessian matrix:Taking the derivative of the single-sample gradientagain:The total Hessian is:where.

Positive Definiteness Analysis: For any,because(Sigmoid output is in). Therefore,is positive definite, the loss functionis strictly convex, and there exists a unique global optimum.

Gradient Descent and Stochastic Optimization

Batch Gradient Descent (BGD):

Stochastic Gradient Descent (SGD): Randomly select one sampleeach time,

Mini-batch Gradient Descent: Select batch sizeeach time,

Multiclass Generalization: Softmax Regression

From Binary to Multiclass Classification

For-class classification (), we need to learn a weight vectorfor each class. Define the score for class:Use the Softmax function to normalize scores into probabilities:whereis the parameter matrix.

Normalization Verification:

Cross-Entropy Loss and One-Hot Encoding

Introduce one-hot encoding: if the true class is, then(only the-th position is 1). The loss function is:where.

Simplification: Since each sample has only one(say the-th class), and the rest are 0,This is the Negative Log-Likelihood (NLL) for multiclass classification.

Softmax Gradient Derivation

For the single-sample loss, compute the gradient with respect to:

First term: $𝟙$

Second term:

Combining: $𝟙$ Further taking the derivative with respect to:The total gradient is:Matrix form:whereare the true and predicted one-hot matrices, respectively.

Regularization Techniques

L2 Regularization (Ridge Logistic Regression)

Adding an L2 penalty term:The gradient becomes:Update formula:Theterm produces a weight decay effect.

L1 Regularization (Lasso Logistic Regression)

Adding an L1 penalty:The L1 norm is non-differentiable at 0, using the subgradient:where(when), or(when).

Sparsity: L1 regularization tends to produce sparse solutions (many weights are exactly 0), achieving feature selection.

Elastic Net

Combining L1 and L2:Combining sparsity and stability.

Decision Boundaries and Geometric Interpretation

Binary Classification Decision Boundary

The decision rule for logistic regression:Since, the decision boundary is:This is a hyperplane in the feature space.

Distance from a Point to the Boundary: For sample, its signed distance to the decision boundary is:The greater the distance, the more confident the classification.

Multiclass Decision Regions

In-class classification, the boundary between classand classis:The feature space is divided intoregions, each corresponding to a class. Boundaries between adjacent regions are linear.

Model Evaluation and Diagnostics

Confusion Matrix and Performance Metrics

For binary classification, define:

TP (True Positive): Actually positive, predicted positive
FP (False Positive): Actually negative, predicted positive
TN (True Negative): Actually negative, predicted negative
FN (False Negative): Actually positive, predicted negative

Accuracy:

Precision:

Recall:

F1 Score:

ROC Curve and AUC

By varying the decision threshold(classifying as positive if), we obtain different (FPR, TPR) combinations:

True Positive Rate (TPR):
False Positive Rate (FPR): ROC Curve: Curve with FPR as the x-axis and TPR as the y-axis.

AUC (Area Under Curve): The area under the ROC curve, measuring ranking ability. AUC=1 indicates a perfect classifier, AUC=0.5 indicates random guessing.

Probabilistic Interpretation: AUC equals the probability that a randomly selected positive sample has a higher score than a randomly selected negative sample.

Implementation Details and Numerical Stability

Numerical Overflow in Sigmoid Function

Whenis very large,approaches 0, but computingmay overflow. Stable implementation:

def stable_sigmoid(z):
    if z >= 0:
        return 1 / (1 + np.exp(-z))
    else:
        exp_z = np.exp(z)
        return exp_z / (1 + exp_z)

Numerical Stability of Softmax

Directly computingmay overflow. Using Softmax's translation invariance:Taking:

def stable_softmax(z):
    z_max = np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z - z_max)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

Complete Training Code

import numpy as np

class LogisticRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000, 
                 regularization='l2', lambda_reg=0.01):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.reg = regularization
        self.lambda_reg = lambda_reg
        self.w = None
        
    def sigmoid(self, z):
        return np.where(z >= 0, 
                        1 / (1 + np.exp(-z)),
                        np.exp(z) / (1 + np.exp(z)))
    
    def fit(self, X, y):
        N, d = X.shape
        self.w = np.zeros(d)
        
        for _ in range(self.n_iter):
            # Forward propagation
            z = X @ self.w
            y_hat = self.sigmoid(z)
            
            # Compute gradient
            grad = X.T @ (y_hat - y) / N
            
            # Add regularization
            if self.reg == 'l2':
                grad += self.lambda_reg * self.w
            elif self.reg == 'l1':
                grad += self.lambda_reg * np.sign(self.w)
            
            # Gradient descent
            self.w -= self.lr * grad
    
    def predict_proba(self, X):
        return self.sigmoid(X @ self.w)
    
    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)

# Example usage
if __name__ == '__main__':
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score, roc_auc_score
    
    # Generate data
    X, y = make_classification(n_samples=1000, n_features=20, 
                                n_informative=15, n_redundant=5, 
                                random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)
    
    # Train model
    model = LogisticRegression(learning_rate=0.1, n_iterations=1000, 
                                regularization='l2', lambda_reg=0.01)
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)
    
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(f"AUC: {roc_auc_score(y_test, y_prob):.4f}")

Multiclass Implementation

class SoftmaxRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000, 
                 lambda_reg=0.01):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.lambda_reg = lambda_reg
        self.W = None
        
    def softmax(self, Z):
        """Stable Softmax computation"""
        Z_max = np.max(Z, axis=1, keepdims=True)
        exp_Z = np.exp(Z - Z_max)
        return exp_Z / np.sum(exp_Z, axis=1, keepdims=True)
    
    def fit(self, X, y):
        N, d = X.shape
        K = len(np.unique(y))
        self.W = np.zeros((d, K))
        
        # Convert to one-hot encoding
        Y_one_hot = np.zeros((N, K))
        Y_one_hot[np.arange(N), y] = 1
        
        for _ in range(self.n_iter):
            # Forward propagation
            Z = X @ self.W
            Y_hat = self.softmax(Z)
            
            # Compute gradient
            grad = X.T @ (Y_hat - Y_one_hot) / N
            grad += self.lambda_reg * self.W  # L2 regularization
            
            # Update weights
            self.W -= self.lr * grad
    
    def predict_proba(self, X):
        Z = X @ self.W
        return self.softmax(Z)
    
    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=1)

# Example usage
if __name__ == '__main__':
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    
    # Generate multiclass data
    X, y = make_classification(n_samples=1000, n_features=20, 
                                n_informative=15, n_redundant=5,
                                n_classes=5, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)
    
    # Train model
    model = SoftmaxRegression(learning_rate=0.1, n_iterations=1000, 
                               lambda_reg=0.01)
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

Connections with Other Classifiers

Relationship with Perceptron

Perceptron update rule:whereis a hard classification. Logistic regression softens it to probabilistic prediction, making training more stable.

Relationship with Linear Discriminant Analysis (LDA)

LDA assumes each class follows a Gaussian distribution with the same covariance matrix. Under this assumption, the posterior probability is:The form is the same as logistic regression. The difference: - LDA: Generative model, estimatesand - Logistic Regression: Discriminative model, directly estimates

Relationship with Neural Networks

Logistic regression can be viewed as a single-layer neural network:Softmax regression is a single-layer multiclass neural network, the foundation of deep learning.

Advanced Topics

Class Imbalance Problem

When the positive-negative sample ratio is severely imbalanced (e.g.,negative samples), the model tends to predict the negative class. Solutions:

1. Adjust Decision Threshold: Adjustbased on business needs (e.g., lower threshold to increase recall)

2. Resampling: - Oversample minority class (SMOTE, etc.) - Undersample majority class

3. Loss Weighting:whereis set based on class frequency (e.g.,).

Online Learning and Streaming Data

For streaming data, use stochastic gradient descent to update one by one:Suitable for large-scale or real-time scenarios.

Multi-Label Classification

Each sample can belong to multiple classes (e.g., text tagging). Train a binary classifier independently for each label:The loss is the sum of cross-entropies for each label:

Q&A Highlights

Q1: Why is it called "logistic regression" instead of "logistic classification"?

A: Historical reasons. Logistic regression was originally used for probabilistic modeling of regression problems, mapping linear model outputs to probabilities through the Logistic function (i.e., Sigmoid). Later it was found to be more suitable for classification tasks, but the name was retained.

Q2: What is the essential difference between logistic regression and linear regression?

A: The core difference lies in the output space and loss function: - Linear regression:, uses MSE loss, assumes noise follows Gaussian distribution - Logistic regression:, uses cross-entropy loss, assumes labels follow Bernoulli distribution

Both are special cases of Generalized Linear Models (GLM), differing only in link functions and assumed distributions.

Q3: Why is cross-entropy more suitable for classification than MSE?

A: The gradient of MSE contains aterm. When the prediction is very confident (or) but incorrect, the gradient approaches 0, almost no update. The gradient of cross-entropy is, producing large gradients even when confidently wrong, quickly correcting errors.

Q4: Can logistic regression fit nonlinear boundaries?

A: Original logistic regression is a linear classifier with hyperplane decision boundaries. But through: - Feature engineering: Add polynomial features (e.g.,) - Kernel methods: Implicitly map to high-dimensional space - Neural networks: Stack multiple logistic regression layers

Nonlinear classification can be achieved.

Q5: What is the difference between Softmax and multiple independent Sigmoids?

A: - Softmax: Class probabilities are normalized,, used for mutually exclusive multiclass (single label) - Multiple Sigmoids: Labels are independent, probability sum can be, used for multi-label classification

For example, news classification (single category) uses Softmax, tag recommendation (multiple tags) uses multiple Sigmoids.

Q6: How to choose the regularization parameter?

A: Through cross-validation grid search: 1. Candidate values:${10^{-4}, 10^{-3}, , 10^1}$with minimum validation error

Generally, largermeans simpler models (underfitting risk), smallermeans more complex models (overfitting risk).

Q7: Why is logistic regression a convex optimization problem?

A: The Hessian matrixis positive definite (diagonal elements ofare), so the loss function is strictly convex. Convexity guarantees: - Any local optimum is the global optimum - Gradient descent and other algorithms must converge

This is an important advantage of logistic regression.

Q8: Can logistic regression handle missing values?

A: Standard logistic regression does not directly support missing values. Common handling methods: - Deletion: Remove samples with missing values (loss of information) - Imputation: Fill with mean/median/mode - Indicator variables: Add binary indicators for missing features - Model prediction: Predict missing values using other features

Or use algorithms that support missing values (e.g., XGBoost).

Q9: Why is feature standardization needed?

A: Different features have different scales (e.g., age [0,100] vs income [0,1e6]), causing: 1. Large differences in gradient numerical ranges, requiring extremely small learning rates 2. Some features dominate weight updates 3. Regularization is unfair (penalizes large-scale features)

Standardization () makes features comparable, optimizing more stably.

Q10: What is the difference between logistic regression and SVM?

A: | Dimension | Logistic Regression | SVM | |-----------|---------------------|-----| | Loss | Cross-entropy | Hinge Loss | | Output | Probability | Decision value | | Support vectors | All samples participate | Only boundary samples | | Kernel trick | Not directly supported | Naturally supported | | Convexity | Strictly convex | Convex |

Logistic regression is suitable for probabilistic prediction, SVM is suitable for hard classification and nonlinear boundaries.

Q11: How to interpret logistic regression coefficients?

A: The weightrepresents the effect of featureon the log-odds: -: Asincreases, positive class probability increases -: Asincreases, negative class probability increases - Large: Strong influence

Odds ratio: Increasingby 1 unit multiplies the odds by.

Q12: What is the time complexity of logistic regression?

A: - Single iteration:(matrix multiplication) - Total training:, whereis the number of iterations - Prediction:(samples)

For large-scale data, using stochastic gradient descent (SGD) or mini-batch gradient descent, a single iteration reduces to(whereis the batch size).

Experiments and Case Studies

Case 1: Spam Email Classification

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Simulated data
emails = [
    "Win a free iPhone now!",
    "Meeting at 3pm tomorrow",
    "You have won$1,000,000!",
    "Project deadline reminder",
    # ... more emails
]
labels = [1, 0, 1, 0, ...]  # 1=spam, 0=normal

# Text feature extraction
vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(emails).toarray()

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.2, random_state=42)

# Train model
model = LogisticRegression(learning_rate=0.1, n_iterations=1000, 
                            regularization='l2', lambda_reg=0.1)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

# Feature importance analysis
feature_names = vectorizer.get_feature_names_out()
top_indices = np.argsort(np.abs(model.w))[-10:]  # Top 10 important features
print("Most relevant words:", feature_names[top_indices])

Case 2: Medical Diagnosis

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Load data
data = load_breast_cancer()
X, y = data.data, data.target

# Standardize
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Train models with different regularization strengths
lambdas = [0, 0.01, 0.1, 1.0]
for lam in lambdas:
    model = LogisticRegression(learning_rate=0.1, n_iterations=1000, 
                                regularization='l2', lambda_reg=lam)
    model.fit(X_train, y_train)
    
    y_prob = model.predict_proba(X_test)
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)
    
    plt.plot(fpr, tpr, label=f'λ={lam} (AUC={roc_auc:.3f})')

plt.plot([0,1], [0,1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves with Different Regularizations')
plt.legend()
plt.show()

References

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. [Chapter 4: Linear Models for Classification]
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. [Chapter 4: Linear Methods for Classification]
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. [Chapter 8: Logistic Regression]
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. [Chapter 5: Machine Learning Basics]
Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems, 14.
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
Wright, R. E. (1995). Logistic regression. In Reading and Understanding Multivariate Statistics (pp. 217-244). American Psychological Association.

Logistic regression, with its concise mathematical form, clear probabilistic interpretation, and efficient optimization algorithms, has become a baseline model for classification tasks. From Sigmoid to Softmax, from gradient descent to regularization, this chapter has completely derived the theoretical framework of logistic regression. Understanding logistic regression is not only the foundation for mastering classical machine learning, but also a necessary path to delving into neural networks and deep learning — after all, every layer of a deep neural network contains the shadow of logistic regression.