Machine Learning Mathematical Derivations (9): Naive Bayes
Chen Kai BOSS

Naive Bayes is the simplest yet most elegant probabilistic classifier — based on Bayes' theorem and conditional independence assumption, it decomposes complex joint probabilities into simple products of conditional probabilities, enabling efficient classification learning. Despite the "naive" assumption often not holding in reality, Naive Bayes shows remarkable effectiveness in text classification, spam filtering, and sentiment analysis. This chapter systematically derives the theoretical foundations, parameter estimation methods, smoothing techniques, and performance analysis of Naive Bayes.

Bayesian Decision Theory Foundations

Bayes' Theorem and Posterior Probability

Bayes' Theorem is the core formula of probability theory, describing how to update prior beliefs through observed data:

where: -: Posterior probability, probability of classafter observing -: Class-conditional likelihood, probability of observinggiven class -: Prior probability, prior distribution of class -: Evidence, marginal probability of

Bayes Optimal Classifier

For classification tasks, Bayesian decision theory gives the optimal decision rule: select the class with maximum posterior probability:

Since denominatoris same for all classes, equivalent to:

This is the Maximum A Posteriori (MAP) criterion.

Theoretical guarantee: Bayes optimal classifier minimizes classification error rate:

This is the minimum error any classifier can achieve (Bayes Error).

Discriminative vs Generative Models

Discriminative Models: Directly learn - Examples: Logistic regression, SVMs - Advantages: Focus on decision boundary, often better performance - Disadvantages: Cannot generate samples, cannot handle partial observations

Generative Models: Learn joint distribution, then obtainvia Bayes' theorem - Examples: Naive Bayes, Hidden Markov Models - Advantages: Can generate samples, handle missing data, incorporate prior knowledge - Disadvantages: Require more assumptions, may have bias

Naive Bayes is a typical generative model.

The following figure visualizes Bayes' theorem and conditional independence:

Bayes Theorem Visualization

Naive Bayes Classifier

Conditional Independence Assumption

For high-dimensional features, the number of parameters for class-conditional probabilitygrows exponentially with dimension, making estimation difficult.

Naive Bayes Assumption: Given class, all features are mutually independent:

Intuitive explanation: The class is the "cause," features are "effects." Assuming each "effect" occurs independently once the cause is determined.

Naive Bayes Classification Rule

Combining Bayes' theorem and conditional independence assumption:

Classifier:

To avoid numerical underflow, use log probabilities in practice:

Parameter Estimation: Maximum Likelihood

Given training set, we need to estimate:

  • Prior probability:
  • Class-conditional probability: Prior probability estimate:

whereis number of samples in class.

Class-conditional probability estimates depend on feature type:

Discrete Features

For discrete feature with value set:

i.e., proportion of samples in class where feature takes value .

Continuous Features

Assume class-conditional probability follows Gaussian distribution:

Parameter estimates:

Laplace Smoothing

Zero probability problem: If some classnever has feature valuein training set, then, making entire posterior probability zero, preventing classification.

Laplace Smoothing: Add pseudocount(usually) to each count:

whereis number of values for feature.

Prior probability smoothing:

whereis number of classes.

Bayesian interpretation: Laplace smoothing is equivalent to Bayesian estimation with Dirichlet prioron multinomial distribution.

The effect of Laplace smoothing is demonstrated below:

Laplace Smoothing

The following figure compares Naive Bayes (diagonal covariance) with Full Bayes (correlated features):

Naive vs Full Bayes

Three Types of Naive Bayes Models

Multinomial Naive Bayes

Suitable for discrete count features, like word frequencies in text data.

Model: Assume featureis word frequency vector,is count of word. Class-conditional probability is multinomial:

where.

Parameter estimation:

i.e., word 's frequency as proportion of total word frequency in class .

The following figure illustrates Gaussian Naive Bayes with class-conditional densities:

Gaussian Naive Bayes

The figure below shows text classification with Multinomial Naive Bayes:

Text Classification

Bernoulli Naive Bayes

Suitable for binary features, like presence/absence of words in documents.

Model: , class-conditional probability is Bernoulli:

where .

Difference from Multinomial NB: - Multinomial NB considers word frequency (how many times) - Bernoulli NB only considers word presence (whether appears)

For short texts, Bernoulli NB may be better; for long texts, Multinomial NB usually better.

Gaussian Naive Bayes

Suitable for continuous features, assuming class-conditional probability is Gaussian (see above).

Applications: Sensor data, biomedical features, etc.

Complete Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
import numpy as np
from collections import defaultdict

class NaiveBayes:
def __init__(self, alpha=1.0):
"""
Parameters:
alpha: Laplace smoothing parameter
"""
self.alpha = alpha
self.classes = None
self.class_prior = {}
self.feature_prob = defaultdict(lambda: defaultdict(dict))

def fit(self, X, y):
"""
Train Naive Bayes classifier

Parameters:
X: ndarray, shape (N, d), discrete features
y: ndarray, shape (N,), class labels
"""
N, d = X.shape
self.classes = np.unique(y)
K = len(self.classes)

# Compute prior probabilities
for c in self.classes:
N_c = np.sum(y == c)
self.class_prior[c] = (N_c + self.alpha) / (N + K * self.alpha)

# Compute class-conditional probabilities
for j in range(d):
values = np.unique(X[:, j])
S_j = len(values)

for c in self.classes:
X_c = X[y == c]
N_c = len(X_c)

for val in values:
count = np.sum(X_c[:, j] == val)
# Laplace smoothing
self.feature_prob[j][c][val] = (count + self.alpha) / (N_c + S_j * self.alpha)

return self

def predict_log_proba(self, X):
"""Compute log posterior probabilities"""
N = X.shape[0]
K = len(self.classes)
log_proba = np.zeros((N, K))

for idx, c in enumerate(self.classes):
# Log prior
log_proba[:, idx] = np.log(self.class_prior[c])

# Log class-conditional
for j in range(X.shape[1]):
for i in range(N):
val = X[i, j]
if val in self.feature_prob[j][c]:
log_proba[i, idx] += np.log(self.feature_prob[j][c][val])

return log_proba

def predict(self, X):
"""Predict classes"""
log_proba = self.predict_log_proba(X)
return self.classes[np.argmax(log_proba, axis=1)]

def score(self, X, y):
"""Compute accuracy"""
return np.mean(self.predict(X) == y)

class GaussianNB:
"""Gaussian Naive Bayes (continuous features)"""
def __init__(self):
self.classes = None
self.class_prior = {}
self.theta = {} # Means
self.sigma = {} # Standard deviations

def fit(self, X, y):
N, d = X.shape
self.classes = np.unique(y)

for c in self.classes:
X_c = X[y == c]
self.class_prior[c] = len(X_c) / N
self.theta[c] = np.mean(X_c, axis=0)
self.sigma[c] = np.std(X_c, axis=0) + 1e-9 # Avoid division by zero

return self

def _gaussian_prob(self, x, mean, std):
"""Gaussian probability density"""
exponent = -((x - mean) ** 2) / (2 * std ** 2)
return np.exp(exponent) / (np.sqrt(2 * np.pi) * std)

def predict_log_proba(self, X):
N = X.shape[0]
K = len(self.classes)
log_proba = np.zeros((N, K))

for idx, c in enumerate(self.classes):
log_proba[:, idx] = np.log(self.class_prior[c])
for j in range(X.shape[1]):
prob = self._gaussian_prob(X[:, j], self.theta[c][j], self.sigma[c][j])
log_proba[:, idx] += np.log(prob + 1e-9)

return log_proba

def predict(self, X):
log_proba = self.predict_log_proba(X)
return self.classes[np.argmax(log_proba, axis=1)]

def score(self, X, y):
return np.mean(self.predict(X) == y)

class MultinomialNB:
"""Multinomial Naive Bayes (word frequency features)"""
def __init__(self, alpha=1.0):
self.alpha = alpha
self.classes = None
self.class_prior = {}
self.feature_log_prob = {}

def fit(self, X, y):
"""
Parameters:
X: ndarray, shape (N, d), word frequency matrix
y: ndarray, shape (N,), class labels
"""
N, d = X.shape
self.classes = np.unique(y)

for c in self.classes:
X_c = X[y == c]
N_c = len(X_c)

# Prior probability
self.class_prior[c] = N_c / N

# Class-conditional probability (word probability distribution)
word_counts = np.sum(X_c, axis=0) # Total frequency of each word
total_count = np.sum(word_counts) # Total words

# Laplace smoothing
self.feature_log_prob[c] = np.log((word_counts + self.alpha) / (total_count + d * self.alpha))

return self

def predict_log_proba(self, X):
N = X.shape[0]
K = len(self.classes)
log_proba = np.zeros((N, K))

for idx, c in enumerate(self.classes):
# Log prior
log_proba[:, idx] = np.log(self.class_prior[c])
# Log likelihood: sum_{j} x^{(j)} * log(theta_j)
log_proba[:, idx] += X @ self.feature_log_prob[c]

return log_proba

def predict(self, X):
log_proba = self.predict_log_proba(X)
return self.classes[np.argmax(log_proba, axis=1)]

def score(self, X, y):
return np.mean(self.predict(X) == y)

Q&A Highlights

Q1: Why called "naive" Bayes?

A: "Naive" refers to the conditional independence assumption — assuming all features are mutually independent given the class. This assumption often doesn't hold in reality (e.g., "machine" and "learning" are highly correlated in documents), but greatly simplifies computation, making the model simple and efficient.


Q2: Relationship between Naive Bayes and logistic regression?

A: Under binary classification and Gaussian assumption, Naive Bayes is equivalent to a special case of logistic regression:

where weightsare analytically determined by Naive Bayes parameters. Difference: Naive Bayes is generative (joint distribution), logistic regression is discriminative (conditional distribution).


Q3: What is the essence of Laplace smoothing?

A: From Bayesian view, Laplace smoothing is the posterior mode (MAP estimate) with uniform Dirichlet prior.corresponds to uniform prior;favors smoother distributions;favors sparser distributions.


Q4: How to handle non-Gaussian continuous features?

A: 1. Kernel density estimation: Non-parametric method to estimate

  1. Discretization: Bin continuous features into discrete features
  2. Transformation: Log transform, Box-Cox transform to make distribution closer to Gaussian

Q5: Can Naive Bayes output probabilities?

A: Yes, but probability values are often inaccurate (too extreme). Reason: Conditional independence assumption causes log-probability accumulation, amplifying bias. For accurate probabilities, consider probability calibration (Platt Scaling, Isotonic Regression).


Q6: How to choose between Multinomial and Bernoulli NB?

A: - Long documents: Multinomial NB (rich word frequency information) - Short documents: Bernoulli NB (presence more important than frequency) - Sparse data: Bernoulli NB (avoids zero probabilities)

In practice, choose via cross-validation.


Q7: Can Naive Bayes handle missing values?

A: Yes, naturally: During prediction, ignore missing features, only compute log-probabilities for observed features:

Q8: Time complexity of Naive Bayes?

A: - Training:(traverse data to count statistics) - Prediction:(classes,features)

Much faster than SVM, neural networks; suitable for large-scale text classification.


Q10: Summary of Naive Bayes pros and cons?

A: Pros: - Simple, efficient, easy to implement - Good performance with small samples - Naturally handles multi-class - High interpretability - Supports online learning

Cons: - Conditional independence assumption often violated - Sensitive to feature correlations - Inaccurate probability estimates - Cannot learn feature interactions


✏️ Exercises and Solutions

Exercise 1: Conditional Independence Property

Problem: Given , , , . If assuming is conditionally independent of given (i.e., ), calculate .

Solution:

Under conditional independence:

Use law of total probability:

But we're given directly. Under conditional independence assumption, the answer is simply:

(The conditional independence means knowing makes irrelevant for predicting .)


Exercise 2: Laplace Smoothing Effect

Problem: In text classification with 2 classes, vocabulary size. In positive class, word "excellent" appears 5 times, total word count 200. Calculate: (1) MLE estimate without smoothing; (2) Laplace smoothed estimate ().

Solution:

(1) MLE without smoothing:

(2) Laplace smoothing ():

Effect: Smoothing reduces probability estimate (redistributes mass to unseen words). For unseen words (count=0):

Prevents zero probability problem.


Exercise 3: Multinomial vs Bernoulli

Problem: Email has 100 words: "win" appears 5 times, "free" appears 3 times. Using Multinomial NB, , . Using Bernoulli NB, , . Calculate contribution of these two words to spam score under each model.

Solution:

Multinomial NB (word frequencies matter):

Bernoulli NB (only presence/absence):

Key difference: - Multinomial: Repetition (5× "win") amplifies signal - Bernoulli: Only cares both words present, ignores frequency

For spam detection with highly repetitive keywords, Multinomial NB is more suitable.


Exercise 4: Gaussian NB Parameter Estimation

Problem: 2D dataset with 6 samples: - Class +1: - Class -1:

Calculate Gaussian NB parameters for class +1.

Solution:

Prior:

Feature 1 parameters (class +1):

Feature 2 parameters (class +1):


Exercise 5: Why Naive Bayes Works Despite Violated Assumptions

Problem: Explain why Naive Bayes often performs well in practice even when the conditional independence assumption is violated.

Solution:

Key insights:

  1. Classification focuses on ranking, not accurate probabilities:
    • NB needs correct, not accurate
    • Even if estimates are biased, class rankings may remain correct
  2. Error cancellation:
    • If all classes have similar correlation patterns, bias affects all classes similarly
    • Relative posterior probabilities may stay correct
  3. Occam's Razor effect:
    • Simpler model (independence assumption) has lower variance
    • Bias-variance tradeoff: increased bias may be offset by reduced variance
  4. Calibration isn't classification:
    • Domingos & Pazzani (1997) proved: NB is optimal under 0-1 loss if true class has highest probability (even if probabilities are wrong)
    • For many tasks, NB's probability ordering is correct

Empirical evidence: Text classification (highly correlated features) shows NB competitive with complex models, validating these theoretical insights.

References

  1. Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2-3), 103-130.
  2. McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. AAAI Workshop on Learning for Text Categorization.
  3. Manning, C. D., Raghavan, P., & Sch ü tze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. [Chapter 13: Text Classification]
  4. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. [Section 8.2: Naive Bayes]
  5. Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. [Chapter 3: Generative Models for Discrete Data]

Naive Bayes, with its minimalist form and surprisingly good performance, is a classic algorithm for machine learning beginners. From spam filtering to sentiment analysis, from medical diagnosis to recommendation systems, Naive Bayes has proven the wisdom of "simplicity is the ultimate sophistication" in countless applications. Understanding Naive Bayes is not only the starting point for probabilistic classification, but also the foundation for Bayesian networks, Hidden Markov Models, and other advanced probabilistic graphical models.

  • Post title:Machine Learning Mathematical Derivations (9): Naive Bayes
  • Post author:Chen Kai
  • Create time:2021-10-12 09:15:00
  • Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-9-Naive-Bayes/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments