Support Vector Machine (SVM) is one of the most elegant algorithms in modern machine learning — it perfectly combines geometric intuition, convex optimization theory, and kernel methods, achieving efficient classification by finding the maximum margin hyperplane. From linear separability to nonlinear mapping, from hard margin to soft margin, from primal form to dual problem, SVM's mathematical derivations showcase the depth and beauty of machine learning theory. This chapter systematically derives the complete theoretical framework of SVM, including Lagrangian duality, KKT conditions, SMO algorithm, kernel function construction, and theoretical guarantees.
Linear Separable Support Vector Machines
Geometric Intuition: Maximum Margin Classifier
Consider binary classification:
Decision function is
Functional Margin: Sample
Positive functional margin indicates correct classification; larger value indicates higher confidence.
Geometric Margin: Sample
Derivation: For hyperplane
This is the standard formula from linear algebra. Adding label
Maximum Margin Optimization Problem
Hard margin SVM aims to find the separating hyperplane with maximum margin:
Equivalently, fix functional margin
This is a Convex Quadratic Programming (QP) problem.
Support Vectors: Samples satisfying
Dual Problem Derivation
Using Lagrange multipliers, introduce non-negative multipliers
Primal problem:
Dual problem:
Step 1: Minimize over
Step 2: Substitute to get dual objective
Dual problem:
KKT Conditions
Since primal problem is convex and satisfies Slater's condition,
strong duality holds. Optimal solution
- Primal feasibility:
- Dual feasibility:
- Complementary slackness:
- Lagrangian stationarity:
,
Key insight: Complementary slackness implies: -
If
Therefore, decision function depends only on support vectors:
Soft Margin Support Vector Machines
Slack Variables and Penalty Parameter
Real data is often not linearly separable. Soft margin
SVM allows some samples to violate margin constraints,
introducing slack variables
Optimization problem:
where
The following figure shows how the C parameter affects the decision boundary:
Dual Problem
The soft margin dual problem:
Only difference from hard margin is added upper bound constraint
The following figure illustrates the difference between hard margin and soft margin SVM:
Kernel Methods and Nonlinear SVM
Feature Mapping and Kernel Functions
For linearly inseparable data, map to high-dimensional space where it becomes separable:
The kernel trick visualizes this transformation:
For nonlinear problems, map input space to high-dimensional feature space:
Perform linear classification in feature space:
Dual problem becomes:
Kernel function is defined as:
Kernel Trick: No need to explicitly compute
Common Kernel Functions
1. Linear Kernel:
2. Polynomial Kernel:
where
3. Gaussian Kernel (RBF Kernel):
where
The following figure compares different kernel functions on the same dataset:

4. Sigmoid Kernel:
Mercer's Theorem and Positive Definite Kernels
Mercer's Theorem: Symmetric function
i.e., for any
SMO Algorithm
Coordinate Ascent and Block Coordinate Descent
Solving dual problem
Coordinate ascent: Optimize one variable at a time,
fix others. But due to equality constraint
SMO Algorithm (Sequential Minimal Optimization,
Platt 1998): Each time select two variables
The animation below demonstrates the convergence process of the SMO algorithm:
SMO Two-Variable Subproblem
Suppose selecting
Elimination: From
Substitute into objective function, becomes univariate quadratic
in
Unconstrained optimal:
where: -
Clipping: Ensure
where
Update
Q&A Highlights
Q1: Why does SVM maximize margin?
A: Geometrically, larger margin means more robust to noise.
Statistically, margin
Q2: What are the advantages of dual problem over primal?
A: 1) Dual depends only on inner products
Q3: Why do only support vectors affect the decision boundary?
A: By KKT complementary slackness, non-support vectors satisfy
Q4: How to choose soft margin parameter
A: Through cross-validation. Larger
Q5: How does RBF kernel parameter
A:
Q6: Why is SVM sensitive to feature scaling?
A: SVM computes margin via Euclidean distance (
Q7: Can SVM handle multi-label classification?
A: Standard SVM is binary classifier. For multi-label (each sample
belongs to multiple classes), train
Q8: Why is SMO algorithm efficient?
A: SMO avoids solving large-scale QP problems (complexity
Q9: Difference between SVM and logistic regression?
A: | Dimension | SVM | Logistic Regression | |-----------|-----|---------------------| | Loss | Hinge Loss | Cross-Entropy | | Output | Decision value (distance) | Probability | | Sparsity | Support vectors (sparse) | All samples participate | | Kernel trick | Natural support | Requires modification | | Multi-class | OvR/OvO | Softmax |
SVM suits hard classification and nonlinear boundaries; logistic regression suits probability prediction and large-scale data.
Q10: How does SVM handle multi-class classification?
A: Two main strategies:
✏️ Exercises and Solutions
Exercise 1: Geometric Margin Calculation
Problem: Given hyperplane
Solution:
Exercise 2: Support Vector Identification
Problem: In hard margin SVM, optimal solution
is
Solution:
Check complementary slackness:
: → Not SV : → Not SV : → Not SV
Wait, let me recalculate. If
Actually, samples satisfying
Correct answer: Support vectors are samples
with
Exercise 3: Kernel Function Validity
Problem: Prove that if
Solution:
Valid kernels correspond to positive semi-definite (PSD) kernel
matrices. For any samples
Since
Exercise 4: Soft Margin and C Parameter
Problem: In soft margin SVM with 100 training
samples, when
Solution:
Large
(e.g., ): Heavy penalty on slack variables , forcing model to fit training data more tightly, approaching hard margin. Fewer samples become support vectors (only those truly on margin boundary).Small
(e.g., ): Allows more margin violations, larger margin but more samples inside or violating margin. More support vectors ( ) because the constraint allows many samples to have non-zero without reaching upper bound.
In soft margin SVM: - Samples with
Smaller
Exercise 5: RBF Kernel Parameter Selection
Problem: Dataset with 200 samples, 10 features.
Using RBF kernel,
Solution:
(large): Training accuracy 100% indicates overfitting. RBF kernel with large has narrow influence range per sample, creating highly complex decision boundary that memorizes training data. Test accuracy (70%) significantly drops. (small): Training accuracy 95%, test accuracy 60%. The gap is still large (35%), indicating underfitting. Too smooth decision boundary can't capture data structure.
Recommendation: - Neither is optimal. Need to
tune
Best practice: Use grid search with
cross-validation, e.g.,
References
- Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
- Vapnik, V. N. (1998). Statistical Learning Theory. Wiley.
- Platt, J. C. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines. Technical Report MSR-TR-98-14, Microsoft Research.
- Sch ö lkopf, B., & Smola, A. J. (2002). Learning with Kernels. MIT Press.
- Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. Cambridge University Press.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. [Chapter 12: Support Vector Machines]
Support vector machines, with solid theoretical foundation, elegant mathematical form, and excellent practical performance, are a milestone algorithm in machine learning. From margin maximization to kernel tricks, from duality theory to SMO algorithm, SVM perfectly demonstrates the unity of theory and practice. Understanding SVM is not only essential for mastering classical algorithms, but also the path to deep learning (many modern loss functions borrow from Hinge Loss) and advanced kernel method topics.
- Post title:Machine Learning Mathematical Derivations (8): Support Vector Machines
- Post author:Chen Kai
- Create time:2021-10-06 14:45:00
- Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-8-Support-Vector-Machines/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.