The leap from linear regression to logistic regression marks a crucial transition in machine learning from regression tasks to classification tasks. Despite its name containing "regression", logistic regression is actually a foundational classification algorithm that establishes a bridge between linear models and probabilistic predictions through the Sigmoid function. This chapter will deeply derive the mathematical essence of logistic regression: from the construction of likelihood functions to the details of gradient computation, from binary classification to multiclass generalization, from optimization algorithms to regularization techniques, comprehensively revealing the probabilistic modeling philosophy for classification problems.
From Linear Models to Probabilistic Classification
Limitations of Linear Classification
Recall that in linear regression, we established a linear mapping between inputs and continuous outputs:
- Unconstrained output:
can be any real number, but class labels need to be in a finite set - Missing probabilistic interpretation: Linear models cannot provide "the probability that a sample belongs to a class"
Logistic regression solves this contradiction by introducing a
link function: mapping the output of the linear model
to the
Sigmoid Function: From Real Numbers to Probabilities
The Sigmoid function is defined as:
Property 1: Range Constraint
Property 2: Symmetry
Proof:
Property 3: Derivative Self-Representation
Proof:
Definition of Logistic Regression Model
For a binary classification task (
Unified Representation: Using exponential form, the
two probabilities can be combined as:
Maximum Likelihood Estimation and Loss Function
Construction of Likelihood Function
Given a training set
Log-Likelihood and Cross-Entropy
Taking the logarithm gives the log-likelihood:
Information-Theoretic Interpretation: Cross-entropy
measures the difference between the true distribution
Comparison with Mean Squared Error
If using Mean Squared Error (MSE) as the loss:
The gradient of cross-entropy loss is (derived in the next
section):
Gradient Derivation and Optimization Algorithms
Precise Gradient Computation
For the loss of a single sample:
Step 1:
Step 2: Using the Sigmoid derivative property
Step 3:
Combining:
Hessian Matrix and Second-Order Methods
For Newton's method and other second-order optimization, we need to
compute the Hessian matrix:
Positive Definiteness Analysis: For any
Gradient Descent and Stochastic Optimization
Batch Gradient Descent (BGD):
Stochastic Gradient Descent (SGD): Randomly select
one sample
Mini-batch Gradient Descent: Select batch size
Multiclass Generalization: Softmax Regression
From Binary to Multiclass Classification
For
Normalization Verification:
Cross-Entropy Loss and One-Hot Encoding
Introduce one-hot encoding: if the true class is
Simplification: Since each sample has only one
Softmax Gradient Derivation
For the single-sample loss
First term:
Second term:
Combining:
Regularization Techniques
L2 Regularization (Ridge Logistic Regression)
Adding an L2 penalty term:
L1 Regularization (Lasso Logistic Regression)
Adding an L1 penalty:
Sparsity: L1 regularization tends to produce sparse solutions (many weights are exactly 0), achieving feature selection.
Elastic Net
Combining L1 and L2:
Decision Boundaries and Geometric Interpretation
Binary Classification Decision Boundary
The decision rule for logistic regression:
Distance from a Point to the Boundary: For
sample
Multiclass Decision Regions
In
Model Evaluation and Diagnostics
Confusion Matrix and Performance Metrics
For binary classification, define:
- TP (True Positive): Actually positive, predicted positive
- FP (False Positive): Actually negative, predicted positive
- TN (True Negative): Actually negative, predicted negative
- FN (False Negative): Actually positive, predicted negative
Accuracy:
Precision:
Recall:
F1 Score:
ROC Curve and AUC
By varying the decision threshold
- True Positive Rate (TPR):
- False Positive Rate (FPR):
ROC Curve: Curve with FPR as the x-axis and TPR as the y-axis.
AUC (Area Under Curve): The area under the ROC curve, measuring ranking ability. AUC=1 indicates a perfect classifier, AUC=0.5 indicates random guessing.
Probabilistic Interpretation: AUC equals the probability that a randomly selected positive sample has a higher score than a randomly selected negative sample.
Implementation Details and Numerical Stability
Numerical Overflow in Sigmoid Function
When
1 | def stable_sigmoid(z): |
Numerical Stability of Softmax
Directly computing
1 | def stable_softmax(z): |
Complete Training Code
1 | import numpy as np |
Multiclass Implementation
1 | class SoftmaxRegression: |
Connections with Other Classifiers
Relationship with Perceptron
Perceptron update rule:
Relationship with Linear Discriminant Analysis (LDA)
LDA assumes each class follows a Gaussian distribution with the same
covariance matrix. Under this assumption, the posterior probability
is:
Relationship with Neural Networks
Logistic regression can be viewed as a single-layer neural
network:
Advanced Topics
Class Imbalance Problem
When the positive-negative sample ratio is severely imbalanced
(e.g.,
1. Adjust Decision Threshold: Adjust
2. Resampling: - Oversample minority class (SMOTE, etc.) - Undersample majority class
3. Loss Weighting:
Online Learning and Streaming Data
For streaming data, use stochastic gradient descent
to update one by one:
Multi-Label Classification
Each sample can belong to multiple classes (e.g., text tagging).
Train a binary classifier independently for each label
Q&A Highlights
Q1: Why is it called "logistic regression" instead of "logistic classification"?
A: Historical reasons. Logistic regression was originally used for probabilistic modeling of regression problems, mapping linear model outputs to probabilities through the Logistic function (i.e., Sigmoid). Later it was found to be more suitable for classification tasks, but the name was retained.
Q2: What is the essential difference between logistic regression and linear regression?
A: The core difference lies in the output space and loss
function: - Linear regression:
Both are special cases of Generalized Linear Models (GLM), differing only in link functions and assumed distributions.
Q3: Why is cross-entropy more suitable for classification than MSE?
A: The gradient of MSE contains a
Q4: Can logistic regression fit nonlinear boundaries?
A: Original logistic regression is a linear classifier with
hyperplane decision boundaries. But through: - Feature
engineering: Add polynomial features (e.g.,
Nonlinear classification can be achieved.
Q5: What is the difference between Softmax and multiple independent Sigmoids?
A: - Softmax: Class probabilities are
normalized,
For example, news classification (single category) uses Softmax, tag recommendation (multiple tags) uses multiple Sigmoids.
Q6: How to choose the regularization parameter
A: Through cross-validation grid search: 1.
Candidate values:${10^{-4}, 10^{-3}, , 10^1}
Generally, larger
Q7: Why is logistic regression a convex optimization problem?
A: The Hessian matrix
This is an important advantage of logistic regression.
Q8: Can logistic regression handle missing values?
A: Standard logistic regression does not directly support missing values. Common handling methods: - Deletion: Remove samples with missing values (loss of information) - Imputation: Fill with mean/median/mode - Indicator variables: Add binary indicators for missing features - Model prediction: Predict missing values using other features
Or use algorithms that support missing values (e.g., XGBoost).
Q9: Why is feature standardization needed?
A: Different features have different scales (e.g., age [0,100] vs income [0,1e6]), causing: 1. Large differences in gradient numerical ranges, requiring extremely small learning rates 2. Some features dominate weight updates 3. Regularization is unfair (penalizes large-scale features)
Standardization (
Q10: What is the difference between logistic regression and SVM?
A: | Dimension | Logistic Regression | SVM | |-----------|---------------------|-----| | Loss | Cross-entropy | Hinge Loss | | Output | Probability | Decision value | | Support vectors | All samples participate | Only boundary samples | | Kernel trick | Not directly supported | Naturally supported | | Convexity | Strictly convex | Convex |
Logistic regression is suitable for probabilistic prediction, SVM is suitable for hard classification and nonlinear boundaries.
Q11: How to interpret logistic regression coefficients?
A: The weight
Odds ratio: Increasing
Q12: What is the time complexity of logistic regression?
A: - Single iteration:
For large-scale data, using stochastic gradient descent (SGD) or
mini-batch gradient descent, a single iteration reduces to
Experiments and Case Studies
Case 1: Spam Email Classification
1 | import numpy as np |
Case 2: Medical Diagnosis
1 | from sklearn.datasets import load_breast_cancer |
References
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. [Chapter 4: Linear Models for Classification]
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. [Chapter 4: Linear Methods for Classification]
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. [Chapter 8: Logistic Regression]
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. [Chapter 5: Machine Learning Basics]
- Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems, 14.
- Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
- Wright, R. E. (1995). Logistic regression. In Reading and Understanding Multivariate Statistics (pp. 217-244). American Psychological Association.
Logistic regression, with its concise mathematical form, clear probabilistic interpretation, and efficient optimization algorithms, has become a baseline model for classification tasks. From Sigmoid to Softmax, from gradient descent to regularization, this chapter has completely derived the theoretical framework of logistic regression. Understanding logistic regression is not only the foundation for mastering classical machine learning, but also a necessary path to delving into neural networks and deep learning — after all, every layer of a deep neural network contains the shadow of logistic regression.
- Post title:Mathematical Derivations in Machine Learning (6): Logistic Regression and Classification
- Post author:Chen Kai
- Create time:2025-02-15 00:00:00
- Post link:https://www.chenk.top/mathematical-derivations-in-machine-learning-06-logistic-regression-classification/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.