Machine Learning Mathematical Derivations (10): Semi-Naive Bayes and Bayesian Networks

Naive Bayes's conditional independence assumption is too strict; real-world features often have complex dependencies. How can we relax the independence assumption while maintaining computational efficiency and learning more accurate probabilistic models? Semi-Naive Bayes provides an elegant answer — by introducing limited attribute dependencies, it achieves balance between model complexity and expressive power. This chapter delves into SPODE, TAN, AODE and other semi-naive Bayes models, and introduces Bayesian network structure learning and parameter estimation.

Attribute Dependence and Independence Relaxation

Limitations of Naive Bayes

Conditional Independence Assumption:

P( Y=c_k) = _{j=1}^d P(x^{(j)} Y=c_k)$$

Problem: Real-world features are often correlated: - Text: "machine" and "learning" frequently co-occur - Medical: "fever" and "cough" are correlated in flu diagnosis - Images: Adjacent pixels are highly correlated

Consequence: When conditional independence assumption is severely violated, posterior probability estimates have large bias, classification performance degrades.

Independence Relaxation Strategies

Full Joint Distribution: No independence assumption$$

P( Y=c_k) = P(x^{(1)}, , x^{(d)} Y=c_k)$$

Parameter count: (is number of values for feature), infeasible for high-dimensional data.

Semi-Naive Bayes: Introduce limited dependencies, balancing between expressive power and computational complexity.

Core idea: Assume each attribute depends on at most a few other attributes (e.g., parent, sibling nodes).

The following figure compares Naive Bayes, TAN, and AODE structures:

One-Dependence Estimators (ODE)

SPODE: Super-Parent One-Dependence Estimator

Assumption: All attributes depend on classand same super-parent attribute:$$

P( Y=c_k) = P(x^{(p)} Y=c_k) _{j p} P(x^{(j)} Y=c_k, x^{(p)})$$

Parameter estimation:

Super-parent selection: 1. Mutual information criterion: Select attribute with maximum mutual information with $$

p^{*} = _{p} I(X^{(p)}; Y)

p^{*} = {p} {j p} I(X^{(j)}; Y X^{(p)})$$

AODE: Averaged One-Dependence Estimators

Idea: Don't select single super-parent, but average over all possible super-parent SPODE models, reducing selection bias.

Model:$$

P(Y=c_k ) P(Y=c_k) {i: n_i m} P(x^{(i)} Y=c_k) {j i} P(x^{(j)} Y=c_k, x^{(i)})$$

whereis sample count for attribute,is minimum sample threshold (e.g.,), used to filter rare attributes.

Classification rule:$$

y = {c_k} P(Y=c_k) {i: n_i m} P(x^{(i)} Y=c_k) _{j i} P(x^{(j)} Y=c_k, x^{(i)})$$

Advantages: 1. No feature selection needed: Automatically averages multiple models, strong robustness 2. Stable performance: Usually outperforms single SPODE and Naive Bayes 3. Easy to implement: Same parameter estimation as SPODE

TAN: Tree Augmented Naive Bayes

Maximum Weighted Spanning Tree

Idea: Use tree structure to represent attribute dependencies, relaxing independence while maintaining inference efficiency.

Model structure: - Classis parent of all attributes - Attributes form directed tree (each attribute has at most one parent attribute)$$

P( Y=c_k) = _{j=1}^d P(x^{(j)} Y=c_k, x^{(_j)})$$

whereis parent of (ifis root, no parent).

Conditional Mutual Information and Tree Learning

Objective: Maximize conditional likelihood of tree structure

Equivalent problem: Maximize sum of conditional mutual information between attributes

Conditional Mutual Information:$$

I(X^{(j)}; X^{(k)} Y) = _{y, x_j, x_k} P(x_j, x_k, y) $$

Algorithm (Chow-Liu algorithm): 1. Construct complete graph: Nodes are all attributes $Extra close brace or missing open brace\{x^{(1)}, \dots, x^{(d)}}$

Compute edge weights: Each edgehas weight
Maximum spanning tree: Use Kruskal or Prim algorithm to find maximum weighted spanning tree
Select root: Choose any attribute as root (usually one with maximum)
Orient: From root, convert undirected tree to directed tree

Theorem (Friedman et al., 1997): Given tree structure constraint, TAN is the maximum likelihood estimator.

Bayesian Networks Introduction

Directed Acyclic Graphs and Conditional Independence

The following figures show Bayesian network examples and independence types:

Bayesian Network is a directed acyclic graph (DAG), where: - Nodes $Extra close brace or missing open braceV = \{X_1, \dots, X_d, Y}$ represent random variables - Directed edgesrepresent conditional dependencies

Joint distribution decomposition:$$

P(X_1, , X_d, Y) = _{i=1}^{d+1} P(X_i (X_i))$$

whereis set of parents of.

Conditional independence: In Bayesian networks, nodeis conditionally independent of non-descendants given parents.

d-separation: Graphical criterion for judging conditional independence in Bayesian networks.

Structure Learning: Score-Search Methods

Objective: Learn optimal network structurefrom data

Scoring Functions:

BIC Score (Bayesian Information Criterion)

where: -is log-likelihood -is parameter count -is complexity penalty

Interpretation: BIC balances goodness of fit and model complexity, avoiding overfitting.

Search Algorithms

Hill Climbing

Initialize: Start from empty or random graph
Neighborhood operations:
- Add edge: - Delete edge: - Reverse edge: $Extra close brace or missing open brace\mathcal{G} \to \mathcal{G} \cup \{(X_j \to X_i)} \setminus \{(X_i \to X_j)}$
Greedy selection: Choose operation with maximum score improvement
Terminate: No operation can improve score

Problem: Easily trapped in local optima.

Q&A Highlights

Q1: Why doesn't AODE select a single super-parent?

A: Single super-parent selection introduces bias and variance: - Bias: If wrong super-parent selected, performance degrades - Variance: Selection is unstable with small samples

AODE uses model averaging to reduce variance, similar to Bagging in ensemble learning.

Q2: Why does TAN learn tree structure independently for each class?

A: Different classes may have different attribute dependencies. For example: - Spam: "free" and "money" strongly correlated - Normal email: "meeting" and "schedule" strongly correlated

Independent learning captures class-specific dependency patterns.

Q3: How to extend to k-dependence?

A: k-dependence Bayesian Classifier: Allow each attribute to depend on at mostother attributes. -: Naive Bayes -: ODE (SPODE, AODE, TAN) -: Two-dependence estimation (complexity)

Largermeans more expressive power, but parameter count grows exponentially.

Q4: Difficulty of Bayesian network structure learning?

A: Structure learning is NP-hard (proof: Chickering et al., 2004). Reasons: - Search space: Number of DAGs withnodes is super-exponential - Non-convex scoring: Local search easily trapped in local optima

Practical methods: - Constrain search space (fixed ordering, limit parent count) - Heuristic search (hill climbing, simulated annealing, genetic algorithms) - Constraint-based methods (PC algorithm, GES algorithm)

Q5: How to handle continuous features?

A: 1. Discretization: Equal-width, equal-frequency, entropy-based discretization 2. Gaussian assumption: Conditional Gaussian Bayesian network (parameters are means and covariances) 3. Mixed models: Mixed discrete+continuous distributions

Q6: Can Bayesian networks do causal inference?

A: Directed edges don't necessarily represent causal relationships (may be correlation). Causal inference requires: - Intervention (Do-calculus): Pearl's causal graph framework - Counterfactual reasoning: Potential outcomes framework

Standard Bayesian networks can only do associational inference, require additional causal assumptions (e.g., instrumental variables, backdoor criterion) for causal inference.

✏️ Exercises and Solutions

Exercise 1: SPODE Parameter Estimation

Problem: Dataset with 100 samples, class has 60 samples, class has 40. Choose as super-parent, in there are 30 samples with , among which 18 have . Using Laplace smoothing (, has 3 values), calculate .

Solution:

Exercise 2: TAN Advantages

Problem: Compare parameter count for Naive Bayes vs TAN. Given attributes, each with values, classes.

Solution:

Naive Bayes: (each attribute-class pair needs probability parameters)

TAN: (each attribute depends on class plus one parent attribute, needs conditional probability table)

TAN increases parameters by factor of , but captures inter-attribute dependencies. For small (e.g., binary features), parameter growth is manageable.

Exercise 3: Conditional Mutual Information

Problem: Given class , conditional mutual information between and is , between and is . Which two attributes should TAN prioritize connecting?

Solution:

TAN uses Chow-Liu maximum spanning tree algorithm, selecting edges with highest conditional mutual information. Should prioritize connecting and (weight 0.8 > 0.5).

Higher conditional mutual information indicates stronger correlation between attributes given the class, preserving such dependency better models data distribution.

Exercise 4: D-Separation Judgment

Problem: In Bayesian network , determine if following conditional independencies hold: (1) ? (2) ? (3) ?

Solution:

: Holds. Path has collider (v-structure). Without observing , collider blocks path.
: Holds. Given , information from only flows through , but is still unobserved collider, path remains blocked.
: Does NOT hold. Observing collider "activates" the path (explaining away effect), making and dependent.

Exercise 5: AODE vs TAN

Problem: Discuss AODE vs TAN pros/cons in scenarios: (1) attributes, (2) training samples.

Solution:

TAN: Must learn maximum spanning tree, compute edge conditional mutual informations, run MST algorithm (). Structure determined, parameter estimation simple.
AODE: Enumerate all super-parents ( models), each SPODE trained independently. Prediction requires evaluating models.

For , AODE has higher computational cost (prediction ), TAN more efficient (prediction ).

Small sample, high-dimensional: overfitting risk high.
TAN: Tree structure constrains dependencies, parameter count , relatively controlled.
AODE: Averages multiple models, has regularization effect, more robust to overfitting.

In small sample scenarios, AODE more robust; in large samples, TAN more efficient.

References

Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29(2-3), 131-163.
Webb, G. I., Boughton, J. R., & Wang, Z. (2005). Not so naive Bayes: Aggregating one-dependence estimators. Machine Learning, 58(1), 5-24.
Chow, C., & Liu, C. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3), 462-467.
Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann.

Semi-Naive Bayes bridges the simplicity of Naive Bayes and flexibility of Bayesian networks, achieving significant performance improvements with moderate complexity. From SPODE's super-parent dependence to TAN's tree structure learning, from AODE's model averaging to Bayesian network structure search, these methods demonstrate the powerful expressive capability and elegant mathematical structure of probabilistic graphical models. Mastering Semi-Naive Bayes is the essential path to advanced probabilistic reasoning techniques including Bayesian networks and Markov networks.