Naive Bayes's conditional independence assumption is too strict; real-world features often have complex dependencies. How can we relax the independence assumption while maintaining computational efficiency and learning more accurate probabilistic models? Semi-Naive Bayes provides an elegant answer — by introducing limited attribute dependencies, it achieves balance between model complexity and expressive power. This chapter delves into SPODE, TAN, AODE and other semi-naive Bayes models, and introduces Bayesian network structure learning and parameter estimation.
Attribute Dependence and Independence Relaxation
Limitations of Naive Bayes
Conditional Independence Assumption:
$$
P( Y=c_k) = _{j=1}^d P(x^{(j)} Y=c_k)$$
Problem: Real-world features are often correlated: - Text: "machine" and "learning" frequently co-occur - Medical: "fever" and "cough" are correlated in flu diagnosis - Images: Adjacent pixels are highly correlated
Consequence: When conditional independence assumption is severely violated, posterior probability estimates have large bias, classification performance degrades.
Independence Relaxation Strategies
Full Joint Distribution: No independence assumption$$
P( Y=c_k) = P(x^{(1)}, , x^{(d)} Y=c_k)$$
Parameter count:
Semi-Naive Bayes: Introduce limited dependencies, balancing between expressive power and computational complexity.
Core idea: Assume each attribute depends on at most a few other attributes (e.g., parent, sibling nodes).
The following figure compares Naive Bayes, TAN, and AODE structures:
One-Dependence Estimators (ODE)
SPODE: Super-Parent One-Dependence Estimator
Assumption: All attributes depend on class
P( Y=c_k) = P(x^{(p)} Y=c_k) _{j p} P(x^{(j)} Y=c_k, x^{(p)})$$
Parameter estimation:
Super-parent selection: 1. Mutual
information criterion: Select attribute with maximum mutual
information with
p^{*} = _{p} I(X^{(p)}; Y)
p^{*} = {p} {j p} I(X^{(j)}; Y X^{(p)})$$ 
AODE: Averaged One-Dependence Estimators
Idea: Don't select single super-parent, but average over all possible super-parent SPODE models, reducing selection bias.
Model:$$
P(Y=c_k ) P(Y=c_k) {i: n_i m} P(x^{(i)} Y=c_k) {j i} P(x^{(j)} Y=c_k, x^{(i)})$$
where
Classification rule:$$
y = {c_k} P(Y=c_k) {i: n_i m} P(x^{(i)} Y=c_k) _{j i} P(x^{(j)} Y=c_k, x^{(i)})$$
Advantages: 1. No feature selection needed: Automatically averages multiple models, strong robustness 2. Stable performance: Usually outperforms single SPODE and Naive Bayes 3. Easy to implement: Same parameter estimation as SPODE
TAN: Tree Augmented Naive Bayes
Maximum Weighted Spanning Tree
Idea: Use tree structure to represent attribute dependencies, relaxing independence while maintaining inference efficiency.
Model structure: - Class
P( Y=c_k) = _{j=1}^d P(x^{(j)} Y=c_k, x^{(_j)})$$
where
Conditional Mutual Information and Tree Learning
Objective: Maximize conditional likelihood of tree
structure
Equivalent problem: Maximize sum of conditional
mutual information between attributes
Conditional Mutual Information:$$
I(X^{(j)}; X^{(k)} Y) = _{y, x_j, x_k} P(x_j, x_k, y) $$
Algorithm (Chow-Liu algorithm): 1. Construct
complete graph: Nodes are all attributes
Compute edge weights: Each edge
has weight Maximum spanning tree: Use Kruskal or Prim algorithm to find maximum weighted spanning tree
Select root: Choose any attribute as root (usually one with maximum
) Orient: From root, convert undirected tree to directed tree
Theorem (Friedman et al., 1997): Given tree structure constraint, TAN is the maximum likelihood estimator.
Bayesian Networks Introduction
Directed Acyclic Graphs and Conditional Independence
The following figures show Bayesian network examples and independence types:
Bayesian Network is a directed acyclic graph
(DAG)
Joint distribution decomposition:$$
P(X_1, , X_d, Y) = _{i=1}^{d+1} P(X_i (X_i))$$
where

Conditional independence: In Bayesian networks,
node
d-separation: Graphical criterion for judging conditional independence in Bayesian networks.

Structure Learning: Score-Search Methods
Objective: Learn optimal network structure
Scoring Functions:
BIC
Score (Bayesian Information Criterion)
where: -
Interpretation: BIC balances goodness of fit and model complexity, avoiding overfitting.
Search Algorithms
Hill Climbing
- Initialize: Start from empty or random graph
- Neighborhood operations:
- Add edge:
- Delete edge: - Reverse edge:
- Add edge:
- Greedy selection: Choose operation with maximum score improvement
- Terminate: No operation can improve score
Problem: Easily trapped in local optima.
Q&A Highlights
Q1: Why doesn't AODE select a single super-parent?
A: Single super-parent selection introduces bias and variance: - Bias: If wrong super-parent selected, performance degrades - Variance: Selection is unstable with small samples
AODE uses model averaging to reduce variance, similar to Bagging in ensemble learning.
Q2: Why does TAN learn tree structure independently for each class?
A: Different classes may have different attribute dependencies. For example: - Spam: "free" and "money" strongly correlated - Normal email: "meeting" and "schedule" strongly correlated
Independent learning captures class-specific dependency patterns.
Q3: How to extend to k-dependence?
A: k-dependence Bayesian Classifier: Allow each
attribute to depend on at most
Larger
Q4: Difficulty of Bayesian network structure learning?
A: Structure learning is NP-hard (proof: Chickering et al., 2004).
Reasons: - Search space: Number of DAGs with
Practical methods: - Constrain search space (fixed ordering, limit parent count) - Heuristic search (hill climbing, simulated annealing, genetic algorithms) - Constraint-based methods (PC algorithm, GES algorithm)
Q5: How to handle continuous features?
A: 1. Discretization: Equal-width, equal-frequency, entropy-based discretization 2. Gaussian assumption: Conditional Gaussian Bayesian network (parameters are means and covariances) 3. Mixed models: Mixed discrete+continuous distributions
Q6: Can Bayesian networks do causal inference?
A: Directed edges don't necessarily represent causal relationships (may be correlation). Causal inference requires: - Intervention (Do-calculus): Pearl's causal graph framework - Counterfactual reasoning: Potential outcomes framework
Standard Bayesian networks can only do associational inference, require additional causal assumptions (e.g., instrumental variables, backdoor criterion) for causal inference.
✏️ Exercises and Solutions
Exercise 1: SPODE Parameter Estimation
Problem: Dataset with 100 samples, class
Solution:
Exercise 2: TAN Advantages
Problem: Compare parameter count for Naive Bayes vs
TAN. Given
Solution:
Naive Bayes:
TAN:
TAN increases parameters by factor of
Exercise 3: Conditional Mutual Information
Problem: Given class
Solution:
TAN uses Chow-Liu maximum spanning tree algorithm, selecting edges
with highest conditional mutual information. Should prioritize
connecting
Higher conditional mutual information indicates stronger correlation between attributes given the class, preserving such dependency better models data distribution.
Exercise 4: D-Separation Judgment
Problem: In Bayesian network
Solution:
: Holds. Path has collider (v-structure). Without observing , collider blocks path. : Holds. Given , information from only flows through , but is still unobserved collider, path remains blocked. : Does NOT hold. Observing collider "activates" the path (explaining away effect), making and dependent.
Exercise 5: AODE vs TAN
Problem: Discuss AODE vs TAN pros/cons in scenarios:
(1)
Solution:
:
- TAN: Must learn maximum spanning tree, compute
edge conditional mutual informations, run MST algorithm ( ). Structure determined, parameter estimation simple. - AODE: Enumerate all super-parents (
models), each SPODE trained independently. Prediction requires evaluating models.
For
, :
- Small sample, high-dimensional: overfitting risk high.
- TAN: Tree structure constrains dependencies,
parameter count
, relatively controlled. - AODE: Averages multiple models, has regularization effect, more robust to overfitting.
In small sample scenarios, AODE more robust; in large samples, TAN more efficient.
References
- Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29(2-3), 131-163.
- Webb, G. I., Boughton, J. R., & Wang, Z. (2005). Not so naive Bayes: Aggregating one-dependence estimators. Machine Learning, 58(1), 5-24.
- Chow, C., & Liu, C. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3), 462-467.
- Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
- Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann.
Semi-Naive Bayes bridges the simplicity of Naive Bayes and flexibility of Bayesian networks, achieving significant performance improvements with moderate complexity. From SPODE's super-parent dependence to TAN's tree structure learning, from AODE's model averaging to Bayesian network structure search, these methods demonstrate the powerful expressive capability and elegant mathematical structure of probabilistic graphical models. Mastering Semi-Naive Bayes is the essential path to advanced probabilistic reasoning techniques including Bayesian networks and Markov networks.
- Post title:Machine Learning Mathematical Derivations (10): Semi-Naive Bayes and Bayesian Networks
- Post author:Chen Kai
- Create time:2021-10-18 16:00:00
- Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-10-Semi-Naive-Bayes-and-Bayesian-Networks/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.