Machine Learning Mathematical Derivations (18): Clustering Algorithms

Clustering is a core task in unsupervised learning — automatically discovering group structures based on data similarity without labels. From the elegant simplicity of K-means to the density-adaptive DBSCAN, from hierarchical clustering's tree structure to spectral clustering's graph-theoretic foundation, clustering algorithms provide powerful tools for exploratory data analysis, customer segmentation, image segmentation, and anomaly detection. This chapter deeply derives Lloyd's algorithm and EM interpretation of K-means, linkage criteria in hierarchical clustering, density reachability in DBSCAN, and Laplacian matrix with NCut objective in spectral clustering.

Formalizing the Clustering Problem

Clustering Objective

Input: Dataset $Extra close brace or missing open brace\mathbf{X} = \{\mathbf{x}_1, \dots, \mathbf{x}_N}$ ,

Output: Cluster assignments $Extra close brace or missing open brace\mathbf{c} = \{c_1, \dots, c_N}$ , $Extra close brace or missing open bracec_i \in \{1, \dots, K}$ Principles:

High cohesion: Points within the same cluster have high similarity
Low coupling: Points in different clusters have low similarity

Similarity Measures

Euclidean distance:

Manhattan distance:

Cosine similarity:

Suitable for text and sparse data

Clustering Quality Evaluation

Internal metrics (no labels needed):

Silhouette Coefficient:

$Extra close brace or missing open braces(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)}}$

whereis average distance from pointto other points in same cluster,is average distance to nearest other cluster

Range, higher is better

Calinski-Harabasz Index:

whereis between-cluster scatter matrix,is within-cluster scatter matrix

External metrics (require ground truth labels):

Adjusted Rand Index (ARI): Range, higher is better
Normalized Mutual Information (NMI): Range, higher is better

K-means: Centroid-Driven Clustering

Lloyd's Algorithm

Objective: Minimize Within-Cluster Sum of Squares (WCSS)

whereis the centroid of cluster

Optimization problem: $Extra close brace or missing open brace\min_{\{\boldsymbol{\mu}_k}, \{c_i}} \sum_{k=1}^K \sum_{i: c_i = k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2$

Difficulty: Joint optimization of discrete variablesand continuous variablesis NP-hard

Solution: Coordinate descent (Lloyd's algorithm)

Lloyd's Algorithm Derivation

Step 1 (Assignment): Fix $Extra close brace or missing open brace\{\boldsymbol{\mu}_k}$ , optimize $Extra close brace or missing open brace\{c_i}$

Physical meaning: Assign each point to nearest centroid

Step 2 (Update): Fix $Extra close brace or missing open brace\{c_i}$ , optimize $Extra close brace or missing open brace\{\boldsymbol{\mu}_k}$

where $Extra close brace or missing open braceC_k = \{i: c_i = k}$ is the point set of cluster

Physical meaning: Centroid is mean of all cluster points

K-means++ Initialization

Problem: Random initialization may lead to poor local optima

K-means++ strategy:

Randomly select first centroid
For:
- Compute squared minimum distancefrom each point to selected centroids
- Select next centroid with probability Intuition: Tend to select points far from existing centroids, making initial centroids dispersed

Theoretical guarantee: K-means++ expected objective

K-means Limitations

1. Requires specifyingin advance

Solutions: - Elbow Method: Plotvscurve, find elbow - Silhouette coefficient: Choosemaximizing silhouette - BIC/AIC: Information criteria

2. Assumes convex, isotropic clusters

Counterexamples: Ring-shaped clusters, elliptical clusters

Solutions: Kernel K-means, spectral clustering

3. Sensitive to outliers

Solution: K-medoids (use median instead of mean)

Hierarchical Clustering: From Trees to Clusters

Agglomerative Hierarchical Clustering

Idea: Bottom-up cluster merging, building dendrogram

Algorithm flow:

Initialization: Each point is a cluster, $Extra close brace or missing open braceC_i = \{\mathbf{x}_i}$ ,clusters

Iteration:

Find two closest clustersand$C_jC_i C_i C_jC_j$3. Update distance matrix
Repeat untilclusters remain (or 1)

Key: How to define inter-cluster distance?

Linkage Criteria

1. Single Linkage: Minimum distance

Characteristics: Tends to produce chain-shaped clusters; sensitive to outliers

2. Complete Linkage: Maximum distance

Characteristics: Tends to produce compact, spherical clusters; robust to outliers

3. Average Linkage: Average distance

4. Ward Linkage: Minimize variance increase after merging

Characteristics: Consistent with K-means objective; tends to produce similarly-sized clusters

DBSCAN: Density-Driven Clustering

Core Concepts

Neighborhood:-neighborhood of point $Extra close brace or missing open braceN_\epsilon(\mathbf{x}) = \{\mathbf{x}' \in \mathbf{X}: \|\mathbf{x} - \mathbf{x}'\| \leq \epsilon}$

Core point: Neighborhood contains at leastpoints

Border point: Not a core point, but in some core point's neighborhood

Noise point: Neither core point nor border point

Density Reachability

Directly density reachable:is in's neighborhood, andis a core point

Density reachable: There exists a chainsuch thatis directly density reachable from Density connected: There exists pointsuch that bothandare density reachable from Cluster definition: - Maximality: Ifis a core point, all its density reachable points are in the same cluster - Connectivity: Any two points within a cluster are density connected

DBSCAN Advantages and Limitations

Advantages:

No need to specify cluster count
Can discover arbitrarily-shaped clusters
Automatically identifies noise points
Robust to outliers

Limitations:

Parameter sensitive: Choice ofandsignificantly affects results
Difficult with varying density clusters
- Solution: HDBSCAN (hierarchical DBSCAN)
High-dimensional data challenges: Curse of dimensionality causes distance concentration

Spectral Clustering: Graph-Theoretic Perspective

Graph Construction

View data as a graph:

Nodes: Data points
Edge weights: Similarity Similarity matrix:

Degree matrix:

Graph Laplacian Matrix

Unnormalized Laplacian:

Properties:

Symmetric positive semi-definite:$^T = {i,j} W{ij} (x_i - x_j)^2 $2. Smallest eigenvalue is 0, corresponding eigenvector is all-ones vector
Second smallest eigenvalue (Fiedler value) measures graph connectivity

Normalized Laplacian (symmetric version):

Spectral Clustering Optimization Objective

NCut (Normalized Cut):

whereand Objective: Minimize inter-cluster connections while balancing cluster sizes/volumes

Spectral Clustering Algorithm

Input: Data, cluster count, similarity parameter Steps:

Build similarity matrix:
Compute degree matrix:
Compute Laplacian matrix: or
Eigendecomposition: Find firstsmallest eigenvalue eigenvectors
Build embedding matrix:
Normalize (for): Normalize each row ofto unit length
K-means clustering: Run K-means on row vectors of
Assign cluster labels: If K-means assigns rowto cluster, then Complexity:(similarity matrix) +(eigendecomposition)

Spectral Clustering Advantages and Limitations

Advantages:

Can discover non-convex clusters
Solid theoretical foundation from graph theory
Only needs similarity matrix, not original features

Limitations:

High complexity:, difficult for large-scale data
- Solution: Nystrom approximation, random sampling
Parameter sensitive: Choice ofsignificantly affects results
Still requires specifying cluster count

Q&A

Q1: Why does K-means assume spherical clusters?

A: K-means uses Euclidean distance, assigning points to nearest centroid. This implicitly assumes each cluster's covariance matrix is the identity matrix (isotropic). If clusters are elliptical or irregular, K-means will fail. Solutions: GMM (allows different covariances) or spectral clustering.

Q2: How to determine optimal K?

A: Common methods: 1. Elbow method: Plot WCSS vs K curve, find elbow 2. Silhouette coefficient: Choose K maximizing average silhouette 3. Gap statistic: Compare WCSS with random data's WCSS 4. Domain knowledge: Consider practical problem

In practice, usually combine multiple methods and consider interpretability.

Q3: How does DBSCAN handle varying density clusters?

A: DBSCAN struggles with large density differences because a singlecannot adapt to all clusters. Solutions: - HDBSCAN: Hierarchical DBSCAN, adaptive to different densities - OPTICS: Generates reachability plot, allows multi-scale analysis - Multiple runs: Run DBSCAN separately for different density regions

Q4: Why does spectral clustering need a final K-means step?

A: Eigenvectorsfrom eigendecomposition are continuous; need discretization to cluster labels. K-means works well in low-dimensional embedding space because: - Spectral embedding already projects data to easily separable space - Embedding vector clusters correspond to original graph partitions

Theoretically, discretization is a necessary step in relaxed optimization.

Q5: GMM vs K-means, when to choose GMM?

A: Choose GMM when: - Clusters are elliptical (different variance directions) - Need soft assignment (probability of belonging to multiple clusters) - Need probabilistic interpretation (generative model)

Choose K-means when: - Clusters are approximately spherical - Need speed (K-means is much faster) - Large data (GMM covariance estimation needs more samples)

✏️ Exercises and Solutions

Exercise 1: K-means Objective

Problem: Prove K-means update step decreases .

Solution: Derivative: gives (cluster mean), which is minimum.

Exercise 2: DBSCAN Parameters

Problem: . Point has 2 neighbors within radius 0.5. Is a core point?

Solution: 's-neighborhood has 3 points (including ), equal to, yes it's a core point.

Exercise 3: GMM vs K-means

Problem: Compare GMM and K-means.

Solution: K-means: hard assignment, spherical clusters, Lloyd algorithm. GMM: soft assignment (probabilities), elliptical clusters (covariance), EM algorithm. GMM is probabilistic generalization of K-means.

Exercise 4: Hierarchical Linkage

Problem: Performance of single-linkage vs complete-linkage on noise.

Solution: Single-linkage: sensitive to noise, "chaining effect". Complete-linkage: robust to noise, produces compact clusters. Ward linkage (minimize variance increase) offers best overall performance.

Exercise 5: Silhouette Coefficient

Problem: Point in cluster , intra-cluster distance , nearest-cluster distance . Compute .

Solution: . Since , clustering quality is good. Average silhouettecloser to 1 means better clustering.

References

[1] Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137.

[2] Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. SODA, 1027-1035.

[3] Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. KDD, 226-231.

[4] Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. NIPS, 849-856.

[5] Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395-416.