Neural Networks are the cornerstone of deep learning — from biological neuron inspiration to multilayer nonlinear transformations, neural networks achieve end-to-end learning through the backpropagation algorithm. From the perceptron convergence theorem to the universal approximation theorem, from vanishing gradient problems to He initialization, from Sigmoid to ReLU, the mathematical principles of neural networks provide a solid foundation for understanding deep models. This chapter deeply derives the matrix form of forward propagation, chain rule of backpropagation, mathematical analysis of vanishing/exploding gradients, and weight initialization strategies.
Perceptron: The Starting Point of Neural Networks
Perceptron Model
Input:
Activation function (step function):
Geometric interpretation: Hyperplane
Perceptron Learning Algorithm
Training data:
where
Update rule (stochastic gradient descent): Select
one misclassified point
Perceptron Convergence Theorem
Theorem (Novikoff, 1962): If data is linearly
separable (exists
Conclusion: At most
Limitations of Perceptron
XOR Problem:
Data:
Solution: Multilayer Perceptron (introduce hidden layers)
Multilayer Perceptron and Forward Propagation
MLP Architecture

Layer structure:
- Input layer:
( features) - Hidden layer 1:
- Hidden layer 2:
- - Output layer:
Forward Propagation Derivation
Computation at layer
Linear transformation:
where:
-
Nonlinear activation:
Activation Functions

1. Sigmoid function:
Derivative:
Properties:
- Output range
- Interpretable as probability
- Problem: Vanishing gradients (
)
2. Tanh function:
Properties:
- Output range
- Zero-centered (better than Sigmoid)
- Problem: Still has vanishing gradients
3. ReLU (Rectified Linear Unit):
Properties:
- Simple computation
- Mitigates vanishing gradients (gradient = 1 in positive region)
- Problem: Dead ReLU (never activates in negative region)
4. Leaky ReLU:
Usually
Universal Approximation Theorem
Theorem (Cybenko, 1989; Hornik, 1991):
Given any continuous function
such that
But note:
- No guarantee of efficient learning (sample complexity, training
time) -
(hidden units) may be very large - Deep networks are more efficient in practice than wide networks
Backpropagation: The Art of the Chain Rule
Loss Functions
Regression task (mean squared error):
Classification task (cross-entropy):
where
Backpropagation Derivation (Output Layer)

Objective: Compute
Output layer (
For mean squared error:
where
For Softmax + Cross-entropy (special
simplification):
Backpropagation Derivation (Hidden Layers)
Recursion relation:
Physical meaning: Error at layer
Weight Gradient Computation
Weight gradient:
Bias gradient:
Vanishing and Exploding Gradients
Vanishing Gradient Problem

Phenomenon: When training deep networks, gradients of early layers approach 0, parameters barely update
Mathematical analysis:
Consider
where
Key observation:
Sigmoid derivative:
Exponential decay! For
Exploding Gradient Problem
Phenomenon: Gradients grow exponentially, parameter updates become huge, causing numerical overflow
Condition: Weight matrix eigenvalues
Solutions for Vanishing Gradients
1. Use ReLU activation
2. Residual connections (ResNet)
Gradients can directly propagate through identity mapping:
3. Batch Normalization
Normalizes activation values, stabilizes gradients
4. Use LSTM/GRU (RNN-specific)
Gating mechanisms control information flow
Solutions for Exploding Gradients
1. Gradient Clipping
2. Weight regularization
Add L2 penalty
3. Proper initialization (see next section)
Weight Initialization Strategies
Why Initialization Matters?
Problem 1: Zero initialization
All neurons compute the same, symmetry prevents learning different features
Problem 2: Too large random initialization
Activation values saturate, gradients vanish
Problem 3: Too small random initialization
Activation values near 0, information lost
Variance Preservation Principle
Objective: Preserve variance of activations and gradients during forward and backward propagation
Xavier Initialization
Derivation (Glorot & Bengio, 2010):
For layer
Assuming
Forward propagation:
To require
Compromise (forward and backward):
Xavier initialization:
Suitable for: Sigmoid, Tanh activation functions
He Initialization
Derivation (He et al., 2015):
For ReLU, variance becomes half:
To require variance unchanged:
He initialization:
Suitable for: ReLU and its variants
Initialization Summary
| Activation | Initialization | Variance |
|---|---|---|
| Sigmoid/Tanh | Xavier | |
| ReLU | He | |
| Leaky ReLU | He (modified) |
Q&A
Q1: Why are activation functions needed?
A: Without activation functions, multilayer linear transformations
remain linear:
Q2: Why is ReLU more popular than Sigmoid?
A: Three major advantages: 1. Mitigates vanishing gradients: Gradient = 1 in positive region 2. Computationally efficient: Only requires comparison and max 3. Sparse activation: About 50% neurons output 0, similar to biological neurons
But note Dead ReLU problem: Once in negative region, permanently outputs 0.
Q3: Why are deep networks more powerful than shallow networks?
A: Theoretical and practical reasons: 1.
Expressiveness: Deep networks can represent functions
with exponentially fewer parameters - Example: Computing
Q4: Why is Batch Normalization effective?
A: Three effects: 1. Reduces internal covariate shift: Stabilizes input distribution at each layer, accelerates convergence 2. Regularization effect: Mini-batch statistics introduce noise, similar to Dropout 3. Allows larger learning rates: More stable gradients
Mathematically, normalization makes loss function Lipschitz constant smaller, optimization smoother.
Q5: How does Dropout prevent overfitting?
A: Two interpretations: 1. Ensemble learning:
Randomly dropping neurons during training equals training
Mathematically, Dropout approximates L2 regularization (on expected weight values).
✏️ Exercises and Solutions
Exercise 1: Backpropagation
Problem:
Exercise 2: Vanishing Gradient
Problem: Why sigmoid causes vanishing gradient?
Solution:
Exercise 3: Batch Normalization
Problem: How does BatchNorm accelerate training? Solution: Normalizes layer inputs, reduces internal covariate shift, allows larger learning rates.
Exercise 4: Dropout
Problem: Training uses 0.5 dropout, what about testing? Solution: Keep all neurons, multiply weights by 0.5 to maintain expected output.
Exercise 5: Xavier Initialization
Problem: Why Xavier uses
✏️ Exercises and Solutions
Exercise 1: Backpropagation
Problem:
Exercise 2: Vanishing Gradient
Problem: Why sigmoid causes vanishing gradient?
Solution:
Exercise 3: Batch Normalization
Problem: How does BatchNorm accelerate training? Solution: Normalizes layer inputs, reduces internal covariate shift, allows larger learning rates.
Exercise 4: Dropout
Problem: Training uses 0.5 dropout, what about testing? Solution: Keep all neurons, multiply weights by 0.5 to maintain expected output.
Exercise 5: Xavier Initialization
Problem: Why Xavier uses
Referencess
[1] Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408.
[2] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
[3] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303-314.
[4] Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS, 249-256.
[5] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. ICCV, 1026-1034.
[6] Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 448-456.
[7] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. (Chapter 6: Deep Feedforward Networks)
- Post title:Machine Learning Mathematical Derivations (19): Neural Networks and Backpropagation
- Post author:Chen Kai
- Create time:2021-12-11 10:45:00
- Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-19-Neural-Networks-and-Backpropagation/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.