Machine Learning Mathematical Derivations (8): Support Vector Machines

Support Vector Machine (SVM) is one of the most elegant algorithms in modern machine learning — it perfectly combines geometric intuition, convex optimization theory, and kernel methods, achieving efficient classification by finding the maximum margin hyperplane. From linear separability to nonlinear mapping, from hard margin to soft margin, from primal form to dual problem, SVM's mathematical derivations showcase the depth and beauty of machine learning theory. This chapter systematically derives the complete theoretical framework of SVM, including Lagrangian duality, KKT conditions, SMO algorithm, kernel function construction, and theoretical guarantees.

Linear Separable Support Vector Machines

Geometric Intuition: Maximum Margin Classifier

Consider binary classification: , where. Linear classifier is defined as:

Decision function is.

Functional Margin: Sample's functional margin is:

Positive functional margin indicates correct classification; larger value indicates higher confidence.

Geometric Margin: Sample's geometric distance to hyperplane is:

Derivation: For hyperplane, distance from pointto hyperplane is:

This is the standard formula from linear algebra. Adding labelensures distance is positive when classified correctly.

Maximum Margin Optimization Problem

Hard margin SVM aims to find the separating hyperplane with maximum margin:

Equivalently, fix functional margin (achievable by scaling), transforming to:

This is a Convex Quadratic Programming (QP) problem.

Support Vectors: Samples satisfyingare called Support Vectors (SVs), lying on the margin boundary and determining the hyperplane position. Other samples () don't affect the solution.

Dual Problem Derivation

Using Lagrange multipliers, introduce non-negative multipliers:

Primal problem:

Dual problem:

Step 1: Minimize over

Step 2: Substitute to get dual objective

Dual problem:

KKT Conditions

Since primal problem is convex and satisfies Slater's condition, strong duality holds. Optimal solutionsatisfies KKT conditions:

Primal feasibility:
Dual feasibility:
Complementary slackness:
Lagrangian stationarity:,

Key insight: Complementary slackness implies: - If, then (is support vector) - If, then (non-support vector)

Therefore, decision function depends only on support vectors:

Soft Margin Support Vector Machines

Slack Variables and Penalty Parameter

Real data is often not linearly separable. Soft margin SVM allows some samples to violate margin constraints, introducing slack variables: -: Sample outside margin, correctly classified -: Sample inside margin but correctly classified -: Sample misclassified

Optimization problem:

where is penalty parameter: - Large : Emphasizes empirical risk, approaches hard margin (may overfit) - Small : Emphasizes structural risk, larger margin but allows more errors (may underfit)

The following figure shows how the C parameter affects the decision boundary:

Dual Problem

The soft margin dual problem:

Only difference from hard margin is added upper bound constraint.

The following figure illustrates the difference between hard margin and soft margin SVM:

Kernel Methods and Nonlinear SVM

Feature Mapping and Kernel Functions

For linearly inseparable data, map to high-dimensional space where it becomes separable:

The kernel trick visualizes this transformation:

For nonlinear problems, map input space to high-dimensional feature space:

Perform linear classification in feature space:

Dual problem becomes:

Kernel function is defined as:

Kernel Trick: No need to explicitly compute, just compute kernel, avoiding computation in high-dimensional or infinite-dimensional feature spaces.

Common Kernel Functions

1. Linear Kernel:

2. Polynomial Kernel:

whereare hyperparameters.

3. Gaussian Kernel (RBF Kernel):

where. RBF kernel corresponds to infinite-dimensional feature space.

The following figure compares different kernel functions on the same dataset:

4. Sigmoid Kernel: (Note: Sigmoid kernel is positive definite only for certain parameters)

Mercer's Theorem and Positive Definite Kernels

Mercer's Theorem: Symmetric functionis a valid kernel function (i.e., exists feature mapsuch that) if and only if for anysamples $Extra close brace or missing open brace\{\mathbf{x}_1, \dots, \mathbf{x}_N}$ , kernel matrixis positive semi-definite:

i.e., for any,.

SMO Algorithm

Coordinate Ascent and Block Coordinate Descent

Solving dual problemmust satisfy constraintsand.

Coordinate ascent: Optimize one variable at a time, fix others. But due to equality constraint, changing one alone violates constraint.

SMO Algorithm (Sequential Minimal Optimization, Platt 1998): Each time select two variables to optimize, fix others.

The animation below demonstrates the convergence process of the SMO algorithm:

SMO Two-Variable Subproblem

Suppose selecting, fixing. Constraint simplifies to:

Elimination: From: (using)

Substitute into objective function, becomes univariate quadratic in:

Unconstrained optimal:

where: -is prediction error -()

Clipping: Ensuresatisfies constraint:

whereare computed differently forvs.

Update:

Q&A Highlights

Q1: Why does SVM maximize margin?

A: Geometrically, larger margin means more robust to noise. Statistically, marginis inversely related to VC dimension (); larger margin means lower model complexity and better generalization. This is the core reason SVM outperforms other linear classifiers.

Q2: What are the advantages of dual problem over primal?

A: 1) Dual depends only on inner products, enabling kernel trick; 2) Constraints simplify frominequalities tobox constraints plus 1 equality; 3) Sparsity of dual optimal (most) makes prediction efficient.

Q3: Why do only support vectors affect the decision boundary?

A: By KKT complementary slackness, non-support vectors satisfywith, contributing nothing to decision function. This is SVM's key advantage: prediction depends on only a few critical samples.

Q4: How to choose soft margin parameter?

A: Through cross-validation. Largeremphasizes training error (approaches hard margin, may overfit); smalleremphasizes margin (may underfit). Common grid search: $Extra close brace or missing open braceC \in \{10^{-3}, 10^{-2}, \dots, 10^3}$ .

Q5: How does RBF kernel parameteraffect the model?

A:controls kernel width. Largermeans narrower kernel, smaller influence range per sample, more complex model (may overfit); smallermeans wider influence, smoother model (may underfit).

Q6: Why is SVM sensitive to feature scaling?

A: SVM computes margin via Euclidean distance (); different feature scales affect contributions ofcomponents. Standardization (like Z-score) lets features compete fairly, preventing large-scale features from dominating decision boundary.

Q7: Can SVM handle multi-label classification?

A: Standard SVM is binary classifier. For multi-label (each sample belongs to multiple classes), trainindependent OvR SVMs, each determining if sample belongs to that class.

Q8: Why is SMO algorithm efficient?

A: SMO avoids solving large-scale QP problems (complexity) by decomposing into small two-variable subproblems (analytical solution,). Selecting variable pairs that most violate KKT conditions accelerates convergence. In practice, tens of times faster than general QP solvers.

Q9: Difference between SVM and logistic regression?

A: | Dimension | SVM | Logistic Regression | |-----------|-----|---------------------| | Loss | Hinge Loss | Cross-Entropy | | Output | Decision value (distance) | Probability | | Sparsity | Support vectors (sparse) | All samples participate | | Kernel trick | Natural support | Requires modification | | Multi-class | OvR/OvO | Softmax |

SVM suits hard classification and nonlinear boundaries; logistic regression suits probability prediction and large-scale data.

Q10: How does SVM handle multi-class classification?

A: Two main strategies:

✏️ Exercises and Solutions

Exercise 1: Geometric Margin Calculation

Problem: Given hyperplaneand point with label , calculate the geometric margin.

Solution:

Exercise 2: Support Vector Identification

Problem: In hard margin SVM, optimal solution is, . Three samples: ; ; . Identify support vectors.

Solution:

Check complementary slackness:

: → Not SV
: → Not SV
: → Not SV

Wait, let me recalculate. If, , the margin is.

Actually, samples satisfying are support vectors. Given the values above, none equal 1. This suggests we need to find which samples lie on the margin boundaries. If this is optimal, at least one sample must be a support vector. Let me assumeandwere designed to be SVs with scaled.

Correct answer: Support vectors are samples with, which are those on the margin boundary ().

Exercise 3: Kernel Function Validity

Problem: Prove that if and are valid kernels, then is also a valid kernel.

Solution:

Valid kernels correspond to positive semi-definite (PSD) kernel matrices. For any samplesand vector:

Sinceandare PSD, their sum is PSD. Therefore, is a valid kernel.

Exercise 4: Soft Margin and C Parameter

Problem: In soft margin SVM with 100 training samples, when , 30 samples have; when , 15 samples have. Explain this phenomenon.

Solution:

Large (e.g., ): Heavy penalty on slack variables, forcing model to fit training data more tightly, approaching hard margin. Fewer samples become support vectors (only those truly on margin boundary).
Small (e.g., ): Allows more margin violations, larger margin but more samples inside or violating margin. More support vectors () because the constraint allows many samples to have non-zero without reaching upper bound.

In soft margin SVM: - Samples with: on margin boundary () - Samples with: violating margin ()

Smaller → More support vectors (more regularization, larger margin)

Exercise 5: RBF Kernel Parameter Selection

Problem: Dataset with 200 samples, 10 features. Using RBF kernel, achieves training accuracy 95%, test accuracy 60%; achieves training accuracy 100%, test accuracy 70%. Which is better? Why?

Solution:

(large): Training accuracy 100% indicates overfitting. RBF kernel with large has narrow influence range per sample, creating highly complex decision boundary that memorizes training data. Test accuracy (70%) significantly drops.
(small): Training accuracy 95%, test accuracy 60%. The gap is still large (35%), indicating underfitting. Too smooth decision boundary can't capture data structure.

Recommendation: - Neither is optimal. Need to tune in rangevia cross-validation. - Ideal: training and test accuracy both high with small gap. - is relatively better (test acc 70% > 60%), but still overfits. Try.

Best practice: Use grid search with cross-validation, e.g., , select by validation accuracy.

References

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley.
Platt, J. C. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines. Technical Report MSR-TR-98-14, Microsoft Research.
Sch ö lkopf, B., & Smola, A. J. (2002). Learning with Kernels. MIT Press.
Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. Cambridge University Press.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. [Chapter 12: Support Vector Machines]

Support vector machines, with solid theoretical foundation, elegant mathematical form, and excellent practical performance, are a milestone algorithm in machine learning. From margin maximization to kernel tricks, from duality theory to SMO algorithm, SVM perfectly demonstrates the unity of theory and practice. Understanding SVM is not only essential for mastering classical algorithms, but also the path to deep learning (many modern loss functions borrow from Hinge Loss) and advanced kernel method topics.