Clustering is a core task in unsupervised learning — automatically discovering group structures based on data similarity without labels. From the elegant simplicity of K-means to the density-adaptive DBSCAN, from hierarchical clustering's tree structure to spectral clustering's graph-theoretic foundation, clustering algorithms provide powerful tools for exploratory data analysis, customer segmentation, image segmentation, and anomaly detection. This chapter deeply derives Lloyd's algorithm and EM interpretation of K-means, linkage criteria in hierarchical clustering, density reachability in DBSCAN, and Laplacian matrix with NCut objective in spectral clustering.
Formalizing the Clustering Problem
Clustering Objective
Input: Dataset
Output: Cluster assignments
- High cohesion: Points within the same cluster have high similarity
- Low coupling: Points in different clusters have low similarity
Similarity Measures
Euclidean distance:
Manhattan distance:
Cosine similarity:
Suitable for text and sparse data
Clustering Quality Evaluation
Internal metrics (no labels needed):
- Silhouette Coefficient:
where
Range

- Calinski-Harabasz Index:
where
External metrics (require ground truth labels):
- Adjusted Rand Index (ARI): Range
, higher is better - Normalized Mutual Information (NMI): Range
, higher is better
K-means: Centroid-Driven Clustering
Lloyd's Algorithm
Objective: Minimize Within-Cluster Sum of Squares (WCSS)
where
Optimization problem:
Difficulty: Joint optimization of discrete
variables
Solution: Coordinate descent (Lloyd's algorithm)
Lloyd's Algorithm Derivation
Step 1 (Assignment): Fix
Physical meaning: Assign each point to nearest centroid
Step 2 (Update): Fix
where
Physical meaning: Centroid is mean of all cluster points
K-means++ Initialization
Problem: Random initialization may lead to poor local optima
K-means++ strategy:
- Randomly select first centroid
- For
:- Compute squared minimum distance
from each point to selected centroids - Select next centroid with probability
Intuition: Tend to select points far from existing centroids, making initial centroids dispersed
- Compute squared minimum distance
Theoretical guarantee: K-means++ expected
objective
K-means Limitations
1. Requires specifying
Solutions: - Elbow Method: Plot
2. Assumes convex, isotropic clusters
Counterexamples: Ring-shaped clusters, elliptical clusters
Solutions: Kernel K-means, spectral clustering
3. Sensitive to outliers
Solution: K-medoids (use median instead of mean)
Hierarchical Clustering: From Trees to Clusters
Agglomerative Hierarchical Clustering
Idea: Bottom-up cluster merging, building dendrogram
Algorithm flow:
Initialization: Each point is a cluster,
Iteration:
- Find two closest clusters
and$C_j C_i C_i C_j C_j$3. Update distance matrix - Repeat until
clusters remain (or 1)
Key: How to define inter-cluster distance
Linkage Criteria
1. Single Linkage: Minimum distance
Characteristics: Tends to produce chain-shaped clusters; sensitive to outliers
2. Complete Linkage: Maximum distance
Characteristics: Tends to produce compact, spherical clusters; robust to outliers
3. Average Linkage: Average distance
4. Ward Linkage: Minimize variance increase after
merging
Characteristics: Consistent with K-means objective; tends to produce similarly-sized clusters
DBSCAN: Density-Driven Clustering
Core Concepts
Neighborhood:
Core point: Neighborhood contains at least
Border point: Not a core point, but in some core point's neighborhood
Noise point: Neither core point nor border point
Density Reachability

Directly density reachable:
Density reachable: There exists a chain
DBSCAN Advantages and Limitations
Advantages:
- No need to specify cluster count
- Can discover arbitrarily-shaped clusters
- Automatically identifies noise points
- Robust to outliers
Limitations:
- Parameter sensitive: Choice of
and significantly affects results - Difficult with varying density clusters
- Solution: HDBSCAN (hierarchical DBSCAN)
- High-dimensional data challenges: Curse of dimensionality causes distance concentration
Spectral Clustering: Graph-Theoretic Perspective
Graph Construction
View data as a graph:
- Nodes: Data points
- Edge weights: Similarity
Similarity matrix:
Degree matrix:
Graph Laplacian Matrix
Unnormalized Laplacian:
Properties:
- Symmetric positive semi-definite:$^T = {i,j} W{ij} (x_i - x_j)^2 $2. Smallest eigenvalue is 0, corresponding eigenvector is all-ones vector
- Second smallest eigenvalue (Fiedler value) measures graph connectivity
Normalized Laplacian (symmetric version):
Spectral Clustering Optimization Objective
NCut (Normalized Cut):
where
Spectral Clustering Algorithm
Input: Data
Build similarity matrix:
Compute degree matrix:
Compute Laplacian matrix:
orEigendecomposition: Find first
smallest eigenvalue eigenvectorsBuild embedding matrix:
Normalize (for
): Normalize each row of to unit lengthK-means clustering: Run K-means on row vectors of
Assign cluster labels: If K-means assigns row
to cluster , then Complexity: (similarity matrix) + (eigendecomposition)
Spectral Clustering Advantages and Limitations
Advantages:
- Can discover non-convex clusters
- Solid theoretical foundation from graph theory
- Only needs similarity matrix, not original features
Limitations:
- High complexity:
, difficult for large-scale data- Solution: Nystrom approximation, random sampling
- Parameter sensitive: Choice of
significantly affects results - Still requires specifying cluster count
Q&A
Q1: Why does K-means assume spherical clusters?
A: K-means uses Euclidean distance, assigning points to nearest centroid. This implicitly assumes each cluster's covariance matrix is the identity matrix (isotropic). If clusters are elliptical or irregular, K-means will fail. Solutions: GMM (allows different covariances) or spectral clustering.
Q2: How to determine optimal K?
A: Common methods: 1. Elbow method: Plot WCSS vs K curve, find elbow 2. Silhouette coefficient: Choose K maximizing average silhouette 3. Gap statistic: Compare WCSS with random data's WCSS 4. Domain knowledge: Consider practical problem
In practice, usually combine multiple methods and consider interpretability.
Q3: How does DBSCAN handle varying density clusters?
A: DBSCAN struggles with large density differences because a
single
Q4: Why does spectral clustering need a final K-means step?
A: Eigenvectors
Theoretically, discretization is a necessary step in relaxed optimization.
Q5: GMM vs K-means, when to choose GMM?
A: Choose GMM when: - Clusters are elliptical (different variance directions) - Need soft assignment (probability of belonging to multiple clusters) - Need probabilistic interpretation (generative model)
Choose K-means when: - Clusters are approximately spherical - Need speed (K-means is much faster) - Large data (GMM covariance estimation needs more samples)
✏️ Exercises and Solutions
Exercise 1: K-means Objective
Problem: Prove K-means update step decreases
Solution: Derivative:
Exercise 2: DBSCAN Parameters
Problem:
Solution:
Exercise 3: GMM vs K-means
Problem: Compare GMM and K-means.
Solution: K-means: hard assignment, spherical clusters, Lloyd algorithm. GMM: soft assignment (probabilities), elliptical clusters (covariance), EM algorithm. GMM is probabilistic generalization of K-means.
Exercise 4: Hierarchical Linkage
Problem: Performance of single-linkage vs complete-linkage on noise.
Solution: Single-linkage: sensitive to noise, "chaining effect". Complete-linkage: robust to noise, produces compact clusters. Ward linkage (minimize variance increase) offers best overall performance.
Exercise 5: Silhouette Coefficient
Problem: Point
Solution:
References
[1] Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137.
[2] Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. SODA, 1027-1035.
[3] Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. KDD, 226-231.
[4] Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. NIPS, 849-856.
[5] Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395-416.
- Post title:Machine Learning Mathematical Derivations (18): Clustering Algorithms
- Post author:Chen Kai
- Create time:2021-12-05 14:15:00
- Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-18-Clustering-Algorithms/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.