Machine Learning Mathematical Derivations (19): Neural Networks and Backpropagation

Neural Networks are the cornerstone of deep learning — from biological neuron inspiration to multilayer nonlinear transformations, neural networks achieve end-to-end learning through the backpropagation algorithm. From the perceptron convergence theorem to the universal approximation theorem, from vanishing gradient problems to He initialization, from Sigmoid to ReLU, the mathematical principles of neural networks provide a solid foundation for understanding deep models. This chapter deeply derives the matrix form of forward propagation, chain rule of backpropagation, mathematical analysis of vanishing/exploding gradients, and weight initialization strategies.

Perceptron: The Starting Point of Neural Networks

Perceptron Model

Input: Weights: Bias: Linear combination:

Activation function (step function):

Geometric interpretation: Hyperplanedivides space into two parts

Perceptron Learning Algorithm

Training data: $Extra close brace or missing open brace\{(\mathbf{x}_i, y_i)}_{i=1}^N$ , $Extra close brace or missing open bracey_i \in \{-1, +1}$ Objective: Findsuch thatfor all Loss function: Sum of distances from misclassified points to hyperplane

whereis the set of misclassified points

Update rule (stochastic gradient descent): Select one misclassified point

Perceptron Convergence Theorem

Theorem (Novikoff, 1962): If data is linearly separable (existsandsuch that), then the perceptron algorithm converges in finite steps

Conclusion: At mostupdates

Limitations of Perceptron

XOR Problem:

Data:,,, Linearly inseparable! Single-layer perceptron cannot solve this

Solution: Multilayer Perceptron (introduce hidden layers)

Multilayer Perceptron and Forward Propagation

MLP Architecture

Layer structure:

Input layer:(features)
Hidden layer 1:
Hidden layer 2: -
Output layer:

Forward Propagation Derivation

Computation at layer:

Linear transformation:

where:

-: Weight matrix -: Bias vector - : Input

Nonlinear activation:

Activation Functions

1. Sigmoid function:

Derivative:

Properties:

Output range
Interpretable as probability
Problem: Vanishing gradients ()

2. Tanh function:

Properties:

Output range
Zero-centered (better than Sigmoid)
Problem: Still has vanishing gradients

3. ReLU (Rectified Linear Unit):

Properties:

Simple computation
Mitigates vanishing gradients (gradient = 1 in positive region)
Problem: Dead ReLU (never activates in negative region)

4. Leaky ReLU:

Usually Properties: Solves Dead ReLU problem

Universal Approximation Theorem

Theorem (Cybenko, 1989; Hornik, 1991):

Given any continuous functionand, there exists a single hidden layer neural network

such that Significance: Neural networks can theoretically approximate any continuous function!

But note:

No guarantee of efficient learning (sample complexity, training time) -(hidden units) may be very large
Deep networks are more efficient in practice than wide networks

Backpropagation: The Art of the Chain Rule

Loss Functions

Regression task (mean squared error):

Classification task (cross-entropy):

where

Backpropagation Derivation (Output Layer)

Objective: Computeand Define error term:

Output layer ():

For mean squared error:

whereis element-wise multiplication

For Softmax + Cross-entropy (special simplification):

Backpropagation Derivation (Hidden Layers)

Recursion relation:

Physical meaning: Error at layeris weighted backward propagation of error from layer

Weight Gradient Computation

Weight gradient:

Bias gradient:

Vanishing and Exploding Gradients

Vanishing Gradient Problem

Phenomenon: When training deep networks, gradients of early layers approach 0, parameters barely update

Mathematical analysis:

Consider-layer network, weight gradient at layer 1:

wherepropagates from output layer via chain rule:

Key observation:

Sigmoid derivative:If weight matrix spectral norm, then:

Exponential decay! For,

Exploding Gradient Problem

Phenomenon: Gradients grow exponentially, parameter updates become huge, causing numerical overflow

Condition: Weight matrix eigenvalues, and activation function derivative

Solutions for Vanishing Gradients

1. Use ReLU activationGradient = 1 in positive region, no exponential decay

2. Residual connections (ResNet)

Gradients can directly propagate through identity mapping:

3. Batch Normalization

Normalizes activation values, stabilizes gradients

4. Use LSTM/GRU (RNN-specific)

Gating mechanisms control information flow

Solutions for Exploding Gradients

1. Gradient Clipping

2. Weight regularization

Add L2 penaltyto limit weight magnitude

3. Proper initialization (see next section)

Weight Initialization Strategies

Why Initialization Matters?

Problem 1: Zero initialization

All neurons compute the same, symmetry prevents learning different features

Problem 2: Too large random initialization

Activation values saturate, gradients vanish

Problem 3: Too small random initialization

Activation values near 0, information lost

Variance Preservation Principle

Objective: Preserve variance of activations and gradients during forward and backward propagation

Xavier Initialization

Derivation (Glorot & Bengio, 2010):

For layer's linear output:

Assuming,,,

Forward propagation:

To require(variance unchanged):

Compromise (forward and backward):

Xavier initialization:

Suitable for: Sigmoid, Tanh activation functions

He Initialization

Derivation (He et al., 2015):

For ReLU, variance becomes half:

To require variance unchanged:

He initialization:

Suitable for: ReLU and its variants

Initialization Summary

Activation	Initialization	Variance
Sigmoid/Tanh	Xavier
ReLU	He
Leaky ReLU	He (modified)

Q&A

Q1: Why are activation functions needed?

A: Without activation functions, multilayer linear transformations remain linear:Cannot learn nonlinear functions! Activation functions introduce nonlinearity, enabling networks to approximate arbitrarily complex functions.

Q2: Why is ReLU more popular than Sigmoid?

A: Three major advantages: 1. Mitigates vanishing gradients: Gradient = 1 in positive region 2. Computationally efficient: Only requires comparison and max 3. Sparse activation: About 50% neurons output 0, similar to biological neurons

But note Dead ReLU problem: Once in negative region, permanently outputs 0.

Q3: Why are deep networks more powerful than shallow networks?

A: Theoretical and practical reasons: 1. Expressiveness: Deep networks can represent functions with exponentially fewer parameters - Example: Computing(XOR) - Network with depthneeds onlyneurons - Shallow network needsneurons 2. Feature hierarchy: Lower layers learn simple features (edges), higher layers learn complex features (objects) 3. Optimization landscape: Deep network loss functions, though non-convex, have better quality local optima

Q4: Why is Batch Normalization effective?

A: Three effects: 1. Reduces internal covariate shift: Stabilizes input distribution at each layer, accelerates convergence 2. Regularization effect: Mini-batch statistics introduce noise, similar to Dropout 3. Allows larger learning rates: More stable gradients

Mathematically, normalization makes loss function Lipschitz constant smaller, optimization smoother.

Q5: How does Dropout prevent overfitting?

A: Two interpretations: 1. Ensemble learning: Randomly dropping neurons during training equals trainingsub-networks (is neuron count), averaging at prediction 2. Regularization: Forces network not to rely on any single neuron, learning more robust features

Mathematically, Dropout approximates L2 regularization (on expected weight values).

✏️ Exercises and Solutions

Exercise 1: Backpropagation

Problem: , . Find. Solution: Chain rule:

Exercise 2: Vanishing Gradient

Problem: Why sigmoid causes vanishing gradient? Solution: , layer-wise multiplication causes exponential decay. ReLU helps.

Exercise 3: Batch Normalization

Problem: How does BatchNorm accelerate training? Solution: Normalizes layer inputs, reduces internal covariate shift, allows larger learning rates.

Exercise 4: Dropout

Problem: Training uses 0.5 dropout, what about testing? Solution: Keep all neurons, multiply weights by 0.5 to maintain expected output.

Exercise 5: Xavier Initialization

Problem: Why Xavier uses? Solution: Maintains variance stability (input=output variance), prevents gradient explosion/vanishing.

✏️ Exercises and Solutions

Exercise 1: Backpropagation

Problem: , . Find. Solution: Chain rule:

Exercise 2: Vanishing Gradient

Problem: Why sigmoid causes vanishing gradient? Solution: , layer-wise multiplication causes exponential decay. ReLU helps.

Exercise 3: Batch Normalization

Problem: How does BatchNorm accelerate training? Solution: Normalizes layer inputs, reduces internal covariate shift, allows larger learning rates.

Exercise 4: Dropout

Problem: Training uses 0.5 dropout, what about testing? Solution: Keep all neurons, multiply weights by 0.5 to maintain expected output.

Exercise 5: Xavier Initialization

Problem: Why Xavier uses? Solution: Maintains variance stability (input=output variance), prevents gradient explosion/vanishing.

Referencess

[1] Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408.

[2] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.

[3] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303-314.

[4] Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS, 249-256.

[5] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. ICCV, 1026-1034.

[6] Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 448-456.

[7] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. (Chapter 6: Deep Feedforward Networks)