Machine Learning Mathematical Derivations (9): Naive Bayes

Naive Bayes is the simplest yet most elegant probabilistic classifier — based on Bayes' theorem and conditional independence assumption, it decomposes complex joint probabilities into simple products of conditional probabilities, enabling efficient classification learning. Despite the "naive" assumption often not holding in reality, Naive Bayes shows remarkable effectiveness in text classification, spam filtering, and sentiment analysis. This chapter systematically derives the theoretical foundations, parameter estimation methods, smoothing techniques, and performance analysis of Naive Bayes.

Bayesian Decision Theory Foundations

Bayes' Theorem and Posterior Probability

Bayes' Theorem is the core formula of probability theory, describing how to update prior beliefs through observed data:

where: -: Posterior probability, probability of classafter observing -: Class-conditional likelihood, probability of observinggiven class -: Prior probability, prior distribution of class -: Evidence, marginal probability of

Bayes Optimal Classifier

For classification tasks, Bayesian decision theory gives the optimal decision rule: select the class with maximum posterior probability:

Since denominatoris same for all classes, equivalent to:

This is the Maximum A Posteriori (MAP) criterion.

Theoretical guarantee: Bayes optimal classifier minimizes classification error rate:

This is the minimum error any classifier can achieve (Bayes Error).

Discriminative vs Generative Models

Discriminative Models: Directly learn - Examples: Logistic regression, SVMs - Advantages: Focus on decision boundary, often better performance - Disadvantages: Cannot generate samples, cannot handle partial observations

Generative Models: Learn joint distribution, then obtainvia Bayes' theorem - Examples: Naive Bayes, Hidden Markov Models - Advantages: Can generate samples, handle missing data, incorporate prior knowledge - Disadvantages: Require more assumptions, may have bias

Naive Bayes is a typical generative model.

The following figure visualizes Bayes' theorem and conditional independence:

Naive Bayes Classifier

Conditional Independence Assumption

For high-dimensional features, the number of parameters for class-conditional probabilitygrows exponentially with dimension, making estimation difficult.

Naive Bayes Assumption: Given class, all features are mutually independent:

Intuitive explanation: The class is the "cause," features are "effects." Assuming each "effect" occurs independently once the cause is determined.

Naive Bayes Classification Rule

Combining Bayes' theorem and conditional independence assumption:

Classifier:

To avoid numerical underflow, use log probabilities in practice:

Parameter Estimation: Maximum Likelihood

Given training set, we need to estimate:

Prior probability:
Class-conditional probability: Prior probability estimate:

whereis number of samples in class.

Class-conditional probability estimates depend on feature type:

Discrete Features

For discrete feature with value set:

i.e., proportion of samples in class where feature takes value .

Continuous Features

Assume class-conditional probability follows Gaussian distribution:

Parameter estimates:

Laplace Smoothing

Zero probability problem: If some classnever has feature valuein training set, then, making entire posterior probability zero, preventing classification.

Laplace Smoothing: Add pseudocount(usually) to each count:

whereis number of values for feature.

Prior probability smoothing:

whereis number of classes.

Bayesian interpretation: Laplace smoothing is equivalent to Bayesian estimation with Dirichlet prioron multinomial distribution.

The effect of Laplace smoothing is demonstrated below:

The following figure compares Naive Bayes (diagonal covariance) with Full Bayes (correlated features):

Three Types of Naive Bayes Models

Multinomial Naive Bayes

Suitable for discrete count features, like word frequencies in text data.

Model: Assume featureis word frequency vector,is count of word. Class-conditional probability is multinomial:

where.

Parameter estimation:

i.e., word 's frequency as proportion of total word frequency in class .

The following figure illustrates Gaussian Naive Bayes with class-conditional densities:

The figure below shows text classification with Multinomial Naive Bayes:

Bernoulli Naive Bayes

Suitable for binary features, like presence/absence of words in documents.

Model: , class-conditional probability is Bernoulli:

where .

Difference from Multinomial NB: - Multinomial NB considers word frequency (how many times) - Bernoulli NB only considers word presence (whether appears)

For short texts, Bernoulli NB may be better; for long texts, Multinomial NB usually better.

Gaussian Naive Bayes

Suitable for continuous features, assuming class-conditional probability is Gaussian (see above).

Applications: Sensor data, biomedical features, etc.

Complete Implementation

import numpy as np
from collections import defaultdict

class NaiveBayes:
    def __init__(self, alpha=1.0):
        """
        Parameters:
            alpha: Laplace smoothing parameter
        """
        self.alpha = alpha
        self.classes = None
        self.class_prior = {}
        self.feature_prob = defaultdict(lambda: defaultdict(dict))
        
    def fit(self, X, y):
        """
        Train Naive Bayes classifier
        
        Parameters:
            X: ndarray, shape (N, d), discrete features
            y: ndarray, shape (N,), class labels
        """
        N, d = X.shape
        self.classes = np.unique(y)
        K = len(self.classes)
        
        # Compute prior probabilities
        for c in self.classes:
            N_c = np.sum(y == c)
            self.class_prior[c] = (N_c + self.alpha) / (N + K * self.alpha)
        
        # Compute class-conditional probabilities
        for j in range(d):
            values = np.unique(X[:, j])
            S_j = len(values)
            
            for c in self.classes:
                X_c = X[y == c]
                N_c = len(X_c)
                
                for val in values:
                    count = np.sum(X_c[:, j] == val)
                    # Laplace smoothing
                    self.feature_prob[j][c][val] = (count + self.alpha) / (N_c + S_j * self.alpha)
        
        return self
    
    def predict_log_proba(self, X):
        """Compute log posterior probabilities"""
        N = X.shape[0]
        K = len(self.classes)
        log_proba = np.zeros((N, K))
        
        for idx, c in enumerate(self.classes):
            # Log prior
            log_proba[:, idx] = np.log(self.class_prior[c])
            
            # Log class-conditional
            for j in range(X.shape[1]):
                for i in range(N):
                    val = X[i, j]
                    if val in self.feature_prob[j][c]:
                        log_proba[i, idx] += np.log(self.feature_prob[j][c][val])
        
        return log_proba
    
    def predict(self, X):
        """Predict classes"""
        log_proba = self.predict_log_proba(X)
        return self.classes[np.argmax(log_proba, axis=1)]
    
    def score(self, X, y):
        """Compute accuracy"""
        return np.mean(self.predict(X) == y)

class GaussianNB:
    """Gaussian Naive Bayes (continuous features)"""
    def __init__(self):
        self.classes = None
        self.class_prior = {}
        self.theta = {}  # Means
        self.sigma = {}  # Standard deviations
        
    def fit(self, X, y):
        N, d = X.shape
        self.classes = np.unique(y)
        
        for c in self.classes:
            X_c = X[y == c]
            self.class_prior[c] = len(X_c) / N
            self.theta[c] = np.mean(X_c, axis=0)
            self.sigma[c] = np.std(X_c, axis=0) + 1e-9  # Avoid division by zero
        
        return self
    
    def _gaussian_prob(self, x, mean, std):
        """Gaussian probability density"""
        exponent = -((x - mean) ** 2) / (2 * std ** 2)
        return np.exp(exponent) / (np.sqrt(2 * np.pi) * std)
    
    def predict_log_proba(self, X):
        N = X.shape[0]
        K = len(self.classes)
        log_proba = np.zeros((N, K))
        
        for idx, c in enumerate(self.classes):
            log_proba[:, idx] = np.log(self.class_prior[c])
            for j in range(X.shape[1]):
                prob = self._gaussian_prob(X[:, j], self.theta[c][j], self.sigma[c][j])
                log_proba[:, idx] += np.log(prob + 1e-9)
        
        return log_proba
    
    def predict(self, X):
        log_proba = self.predict_log_proba(X)
        return self.classes[np.argmax(log_proba, axis=1)]
    
    def score(self, X, y):
        return np.mean(self.predict(X) == y)

class MultinomialNB:
    """Multinomial Naive Bayes (word frequency features)"""
    def __init__(self, alpha=1.0):
        self.alpha = alpha
        self.classes = None
        self.class_prior = {}
        self.feature_log_prob = {}
        
    def fit(self, X, y):
        """
        Parameters:
            X: ndarray, shape (N, d), word frequency matrix
            y: ndarray, shape (N,), class labels
        """
        N, d = X.shape
        self.classes = np.unique(y)
        
        for c in self.classes:
            X_c = X[y == c]
            N_c = len(X_c)
            
            # Prior probability
            self.class_prior[c] = N_c / N
            
            # Class-conditional probability (word probability distribution)
            word_counts = np.sum(X_c, axis=0)  # Total frequency of each word
            total_count = np.sum(word_counts)   # Total words
            
            # Laplace smoothing
            self.feature_log_prob[c] = np.log((word_counts + self.alpha) / (total_count + d * self.alpha))
        
        return self
    
    def predict_log_proba(self, X):
        N = X.shape[0]
        K = len(self.classes)
        log_proba = np.zeros((N, K))
        
        for idx, c in enumerate(self.classes):
            # Log prior
            log_proba[:, idx] = np.log(self.class_prior[c])
            # Log likelihood: sum_{j} x^{(j)} * log(theta_j)
            log_proba[:, idx] += X @ self.feature_log_prob[c]
        
        return log_proba
    
    def predict(self, X):
        log_proba = self.predict_log_proba(X)
        return self.classes[np.argmax(log_proba, axis=1)]
    
    def score(self, X, y):
        return np.mean(self.predict(X) == y)

Q&A Highlights

Q1: Why called "naive" Bayes?

A: "Naive" refers to the conditional independence assumption — assuming all features are mutually independent given the class. This assumption often doesn't hold in reality (e.g., "machine" and "learning" are highly correlated in documents), but greatly simplifies computation, making the model simple and efficient.

Q2: Relationship between Naive Bayes and logistic regression?

A: Under binary classification and Gaussian assumption, Naive Bayes is equivalent to a special case of logistic regression:

where weightsare analytically determined by Naive Bayes parameters. Difference: Naive Bayes is generative (joint distribution), logistic regression is discriminative (conditional distribution).

Q3: What is the essence of Laplace smoothing?

A: From Bayesian view, Laplace smoothing is the posterior mode (MAP estimate) with uniform Dirichlet prior.corresponds to uniform prior;favors smoother distributions;favors sparser distributions.

Q4: How to handle non-Gaussian continuous features?

A: 1. Kernel density estimation: Non-parametric method to estimate

Discretization: Bin continuous features into discrete features
Transformation: Log transform, Box-Cox transform to make distribution closer to Gaussian

Q5: Can Naive Bayes output probabilities?

A: Yes, but probability values are often inaccurate (too extreme). Reason: Conditional independence assumption causes log-probability accumulation, amplifying bias. For accurate probabilities, consider probability calibration (Platt Scaling, Isotonic Regression).

Q6: How to choose between Multinomial and Bernoulli NB?

A: - Long documents: Multinomial NB (rich word frequency information) - Short documents: Bernoulli NB (presence more important than frequency) - Sparse data: Bernoulli NB (avoids zero probabilities)

In practice, choose via cross-validation.

Q7: Can Naive Bayes handle missing values?

A: Yes, naturally: During prediction, ignore missing features, only compute log-probabilities for observed features:

Q8: Time complexity of Naive Bayes?

A: - Training:(traverse data to count statistics) - Prediction:(classes,features)

Much faster than SVM, neural networks; suitable for large-scale text classification.

Q10: Summary of Naive Bayes pros and cons?

A: Pros: - Simple, efficient, easy to implement - Good performance with small samples - Naturally handles multi-class - High interpretability - Supports online learning

Cons: - Conditional independence assumption often violated - Sensitive to feature correlations - Inaccurate probability estimates - Cannot learn feature interactions

✏️ Exercises and Solutions

Exercise 1: Conditional Independence Property

Problem: Given , , , . If assuming is conditionally independent of given (i.e., ), calculate .

Solution:

Under conditional independence:

Use law of total probability:

But we're given directly. Under conditional independence assumption, the answer is simply:

(The conditional independence means knowing makes irrelevant for predicting .)

Exercise 2: Laplace Smoothing Effect

Problem: In text classification with 2 classes, vocabulary size. In positive class, word "excellent" appears 5 times, total word count 200. Calculate: (1) MLE estimate without smoothing; (2) Laplace smoothed estimate ().

Solution:

(1) MLE without smoothing:

(2) Laplace smoothing ():

Effect: Smoothing reduces probability estimate (redistributes mass to unseen words). For unseen words (count=0):

Prevents zero probability problem.

Exercise 3: Multinomial vs Bernoulli

Problem: Email has 100 words: "win" appears 5 times, "free" appears 3 times. Using Multinomial NB, , . Using Bernoulli NB, , . Calculate contribution of these two words to spam score under each model.

Solution:

Multinomial NB (word frequencies matter):

Bernoulli NB (only presence/absence):

Key difference: - Multinomial: Repetition (5× "win") amplifies signal - Bernoulli: Only cares both words present, ignores frequency

For spam detection with highly repetitive keywords, Multinomial NB is more suitable.

Exercise 4: Gaussian NB Parameter Estimation

Problem: 2D dataset with 6 samples: - Class +1: - Class -1:

Calculate Gaussian NB parameters for class +1.

Solution:

Prior:

Feature 1 parameters (class +1):

Feature 2 parameters (class +1):

Exercise 5: Why Naive Bayes Works Despite Violated Assumptions

Problem: Explain why Naive Bayes often performs well in practice even when the conditional independence assumption is violated.

Solution:

Key insights:

Classification focuses on ranking, not accurate probabilities:
- NB needs correct, not accurate
- Even if estimates are biased, class rankings may remain correct
Error cancellation:
- If all classes have similar correlation patterns, bias affects all classes similarly
- Relative posterior probabilities may stay correct
Occam's Razor effect:
- Simpler model (independence assumption) has lower variance
- Bias-variance tradeoff: increased bias may be offset by reduced variance
Calibration isn't classification:
- Domingos & Pazzani (1997) proved: NB is optimal under 0-1 loss if true class has highest probability (even if probabilities are wrong)
- For many tasks, NB's probability ordering is correct

Empirical evidence: Text classification (highly correlated features) shows NB competitive with complex models, validating these theoretical insights.

References

Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2-3), 103-130.
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. AAAI Workshop on Learning for Text Categorization.
Manning, C. D., Raghavan, P., & Sch ü tze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. [Chapter 13: Text Classification]
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. [Section 8.2: Naive Bayes]
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. [Chapter 3: Generative Models for Discrete Data]

Naive Bayes, with its minimalist form and surprisingly good performance, is a classic algorithm for machine learning beginners. From spam filtering to sentiment analysis, from medical diagnosis to recommendation systems, Naive Bayes has proven the wisdom of "simplicity is the ultimate sophistication" in countless applications. Understanding Naive Bayes is not only the starting point for probabilistic classification, but also the foundation for Bayesian networks, Hidden Markov Models, and other advanced probabilistic graphical models.