Naive Bayes is the simplest yet most elegant probabilistic classifier — based on Bayes' theorem and conditional independence assumption, it decomposes complex joint probabilities into simple products of conditional probabilities, enabling efficient classification learning. Despite the "naive" assumption often not holding in reality, Naive Bayes shows remarkable effectiveness in text classification, spam filtering, and sentiment analysis. This chapter systematically derives the theoretical foundations, parameter estimation methods, smoothing techniques, and performance analysis of Naive Bayes.
Bayesian Decision Theory Foundations
Bayes' Theorem and Posterior Probability
Bayes' Theorem is the core formula of probability theory, describing how to update prior beliefs through observed data:
where: -
Bayes Optimal Classifier
For classification tasks, Bayesian decision theory gives the optimal decision rule: select the class with maximum posterior probability:
Since denominator
This is the Maximum A Posteriori (MAP) criterion.
Theoretical guarantee: Bayes optimal classifier minimizes classification error rate:
This is the minimum error any classifier can achieve (Bayes Error).
Discriminative vs Generative Models
Discriminative Models: Directly learn
Generative Models: Learn joint distribution
Naive Bayes is a typical generative model.
The following figure visualizes Bayes' theorem and conditional independence:
Naive Bayes Classifier
Conditional Independence Assumption
For high-dimensional features
Naive Bayes Assumption: Given class
Intuitive explanation: The class is the "cause," features are "effects." Assuming each "effect" occurs independently once the cause is determined.
Naive Bayes Classification Rule
Combining Bayes' theorem and conditional independence
assumption:
Classifier:
To avoid numerical underflow, use log probabilities in practice:

Parameter Estimation: Maximum Likelihood
Given training set
- Prior probability:
- Class-conditional probability:
Prior probability estimate:
where
Class-conditional probability estimates depend on feature type:
Discrete Features
For discrete feature
i.e., proportion of samples in class
Continuous Features
Assume class-conditional probability follows Gaussian distribution:
Parameter estimates:
Laplace Smoothing
Zero probability problem: If some class
Laplace Smoothing: Add pseudocount
where
Prior probability smoothing:
where
Bayesian interpretation: Laplace smoothing is
equivalent to Bayesian estimation with Dirichlet prior
The effect of Laplace smoothing is demonstrated below:
The following figure compares Naive Bayes (diagonal covariance) with Full Bayes (correlated features):
Three Types of Naive Bayes Models
Multinomial Naive Bayes
Suitable for discrete count features, like word frequencies in text data.
Model: Assume feature
where
Parameter estimation:
i.e., word
The following figure illustrates Gaussian Naive Bayes with class-conditional densities:
The figure below shows text classification with Multinomial Naive Bayes:
Bernoulli Naive Bayes
Suitable for binary features, like presence/absence of words in documents.
Model:
where
Difference from Multinomial NB: - Multinomial NB considers word frequency (how many times) - Bernoulli NB only considers word presence (whether appears)
For short texts, Bernoulli NB may be better; for long texts, Multinomial NB usually better.
Gaussian Naive Bayes
Suitable for continuous features, assuming class-conditional probability is Gaussian (see above).
Applications: Sensor data, biomedical features, etc.
Complete Implementation
1 | import numpy as np |
Q&A Highlights
Q1: Why called "naive" Bayes?
A: "Naive" refers to the conditional independence assumption — assuming all features are mutually independent given the class. This assumption often doesn't hold in reality (e.g., "machine" and "learning" are highly correlated in documents), but greatly simplifies computation, making the model simple and efficient.
Q2: Relationship between Naive Bayes and logistic regression?
A: Under binary classification and Gaussian assumption, Naive Bayes is equivalent to a special case of logistic regression:
where weights
Q3: What is the essence of Laplace smoothing?
A: From Bayesian view, Laplace smoothing is the posterior mode (MAP
estimate) with uniform Dirichlet prior
Q4: How to handle non-Gaussian continuous features?
A: 1. Kernel density estimation: Non-parametric
method to estimate
- Discretization: Bin continuous features into discrete features
- Transformation: Log transform, Box-Cox transform to make distribution closer to Gaussian
Q5: Can Naive Bayes output probabilities?
A: Yes, but probability values are often inaccurate (too extreme). Reason: Conditional independence assumption causes log-probability accumulation, amplifying bias. For accurate probabilities, consider probability calibration (Platt Scaling, Isotonic Regression).
Q6: How to choose between Multinomial and Bernoulli NB?
A: - Long documents: Multinomial NB (rich word frequency information) - Short documents: Bernoulli NB (presence more important than frequency) - Sparse data: Bernoulli NB (avoids zero probabilities)
In practice, choose via cross-validation.
Q7: Can Naive Bayes handle missing values?
A:
Yes, naturally: During prediction, ignore missing features, only compute
log-probabilities for observed features:
Q8: Time complexity of Naive Bayes?
A: - Training:
Much faster than SVM, neural networks; suitable for large-scale text classification.
Q10: Summary of Naive Bayes pros and cons?
A: Pros: - Simple, efficient, easy to implement - Good performance with small samples - Naturally handles multi-class - High interpretability - Supports online learning
Cons: - Conditional independence assumption often violated - Sensitive to feature correlations - Inaccurate probability estimates - Cannot learn feature interactions
✏️ Exercises and Solutions
Exercise 1: Conditional Independence Property
Problem: Given
Solution:
Under conditional independence:
Use law of total probability:
But we're given
(The conditional independence means knowing
Exercise 2: Laplace Smoothing Effect
Problem: In text classification with 2 classes,
vocabulary size
Solution:
(1) MLE without smoothing:
(2) Laplace smoothing (
Effect: Smoothing reduces probability estimate (redistributes mass to unseen words). For unseen words (count=0):
Prevents zero probability problem.
Exercise 3: Multinomial vs Bernoulli
Problem: Email has 100 words: "win" appears 5 times,
"free" appears 3 times. Using Multinomial NB,
Solution:
Multinomial NB (word frequencies matter):
Bernoulli NB (only presence/absence):
Key difference: - Multinomial: Repetition (5× "win") amplifies signal - Bernoulli: Only cares both words present, ignores frequency
For spam detection with highly repetitive keywords, Multinomial NB is more suitable.
Exercise 4: Gaussian NB Parameter Estimation
Problem: 2D dataset with 6 samples: - Class +1:
Calculate Gaussian NB parameters for class +1.
Solution:
Prior:
Feature 1 parameters (class +1):
Feature 2 parameters (class +1):
Exercise 5: Why Naive Bayes Works Despite Violated Assumptions
Problem: Explain why Naive Bayes often performs well in practice even when the conditional independence assumption is violated.
Solution:
Key insights:
- Classification focuses on ranking, not accurate
probabilities:
- NB needs correct
, not accurate - Even if
estimates are biased, class rankings may remain correct
- NB needs correct
- Error cancellation:
- If all classes have similar correlation patterns, bias affects all classes similarly
- Relative posterior probabilities may stay correct
- Occam's Razor effect:
- Simpler model (independence assumption) has lower variance
- Bias-variance tradeoff: increased bias may be offset by reduced variance
- Calibration isn't classification:
- Domingos & Pazzani (1997) proved: NB is optimal under 0-1 loss if true class has highest probability (even if probabilities are wrong)
- For many tasks, NB's probability ordering is correct
Empirical evidence: Text classification (highly correlated features) shows NB competitive with complex models, validating these theoretical insights.
References
- Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2-3), 103-130.
- McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. AAAI Workshop on Learning for Text Categorization.
- Manning, C. D., Raghavan, P., & Sch ü tze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. [Chapter 13: Text Classification]
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. [Section 8.2: Naive Bayes]
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. [Chapter 3: Generative Models for Discrete Data]
Naive Bayes, with its minimalist form and surprisingly good performance, is a classic algorithm for machine learning beginners. From spam filtering to sentiment analysis, from medical diagnosis to recommendation systems, Naive Bayes has proven the wisdom of "simplicity is the ultimate sophistication" in countless applications. Understanding Naive Bayes is not only the starting point for probabilistic classification, but also the foundation for Bayesian networks, Hidden Markov Models, and other advanced probabilistic graphical models.
- Post title:Machine Learning Mathematical Derivations (9): Naive Bayes
- Post author:Chen Kai
- Create time:2021-10-12 09:15:00
- Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-9-Naive-Bayes/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.