What is the essence of neural network training? When we perform gradient descent in high-dimensional parameter space, does there exist a deeper continuous-time dynamics? As network width tends to infinity, does discrete parameter updating converge to some elegant partial differential equation? The answers to these questions lie at the intersection of calculus of variations, optimal transport theory, and partial differential equations.
Over the past decade, the success of deep learning has been built primarily on empirical insights and engineering practices. However, recent years have witnessed mathematicians discovering that viewing neural networks as particle systems on the space of probability measures and studying their evolution under Wasserstein geometry can reveal global properties of training dynamics, convergence guarantees, and the essence of phenomena like initialization and over-parameterization. The core tool of this perspective is variational principles— from the principle of least action in physics, to the JKO scheme in modern optimal transport theory, to the mean-field limit of neural networks.
This article systematically establishes this theoretical framework. We begin with classical calculus of variations, introducing fundamental tools such as functional derivatives and Euler-Lagrange equations. We then introduce Wasserstein metrics and gradient flow theory, demonstrating how the heat equation and Fokker-Planck equation can be unified as gradient flows of energy functionals. Finally, we focus on neural network training, deriving mean-field equations, proving global convergence, and validating theoretical predictions through numerical experiments.
Foundations of Calculus of Variations: From Functionals to Euler-Lagrange Equations
Functionals and First Variation
The core object of study in calculus of variations is the functional— a mapping that takes functions to real numbers. Unlike ordinary functions, the "input" to a functional is an entire function, while the "output" is a numerical value.
Definition (Functional): Let
Classical Examples:
- Arc length functional: The length of a curve
on - Surface area: Area of a surface of revolution
- Action functional (physics): Action of a particle
trajectory
where is the Lagrangian function.
The fundamental problem in calculus of variations is: Among all functions satisfying boundary conditions, which one extremizes the functional?
Definition (Gateaux Derivative): The Gateaux
derivative of a functional
Theorem (Euler-Lagrange Equation): Consider the
functional
Classical Variational Problem: The Brachistochrone
Problem (Brachistochrone): In a gravitational field,
along which smooth curve should a particle slide frictionlessly from
point
Set up coordinates with
Hamilton's Principle and Variational Methods in Physics
One of the most profound principles in physics is Hamilton's principle (the principle of least action): the actual motion trajectory extremizes the action functional.
Theorem (Hamilton's Principle): Let the Lagrangian
of a particle be
Example (Harmonic Oscillator): For
Functional Derivatives and Gradient Descent
In optimization theory, variational derivatives correspond to
"gradients in infinite-dimensional space." Consider the functional
Computation Rules:
Point-wise functionals: If
, then Derivative functionals: If
, then (obtained through integration by parts) Chain rule: If
, where , then Example (Dirichlet Energy): For , the variational derivative is The Euler-Lagrange equation yields Laplace's equation .
Gradient Flow: In function space, evolution along
the negative gradient of a functional yields a PDE:
Gradient Flow Theory and Wasserstein Geometry
Gradient Flows in Euclidean Space
In finite-dimensional Euclidean space
Properties:
Energy Dissipation: Along trajectories,
decreases monotonically: Equilibrium Points: Trajectories converge to points satisfying
. Lyapunov Stability: If
is bounded below, trajectories are bounded; if is strongly convex, convergence to the unique global minimum is guaranteed.
Example (Quadratic Function): For
Wasserstein Space and Optimal Transport
When studying the evolution of probability distributions, Euclidean geometry is no longer suitable. We need to introduce the Wasserstein metric, which measures the "optimal transport cost" between distributions.
Definition (Wasserstein-2 Distance): Let
Geometric Intuition:
Monge-Kantorovich Duality: The
Wasserstein Gradient Flows: The JKO Scheme
Core Idea: On the space of probability measures
Definition (JKO Scheme): Given a functional
This scheme was proposed by Jordan, Kinderlehrer, and Otto in 1998,
abbreviated as the JKO scheme. It generalizes the
implicit Euler scheme to the space of probability measures:
Heat Equation as Entropy Gradient Flow
Theorem (Otto): Consider the Boltzmann
entropy
JKO Scheme: For
, Variational Condition: The first-order optimality condition is
where is the optimal transport potential from to . Entropy Variational Derivative:
Optimal Transport Relation: Brenier's theorem implies
satisfies Continuum Limit: Set
, . As , Meanwhile, the evolution of the optimal transport potential gives velocity field , satisfying Combined with the variational condition (from ), we get Energy Dissipation: Along the heat equation, entropy decreases monotonically: where is the Fisher information, always non-negative. This is precisely the energy dissipation property of gradient flows in Wasserstein geometry.
Other Gradient Flow Examples
Fokker-Planck Equation: Consider the free energy
functional
Porous Medium Equation: Consider the internal energy
functional
Keller-Segel Equation: The equation describing
chemotaxis
Mean-Field Theory of Neural Network Training
From Finite Width to Infinite Width
Consider a two-layer neural network:
Loss Function: Given data
Mean-Field Limit: As
Derivation of Mean-Field Equations
Assumptions:
- Initial parameters are independent and identically distributed:
. - Activation function
satisfies Lipschitz conditions. - Loss function
is smooth with bounded gradient.
Representation: Network output can be written
as
Loss Functional:
Global Convergence Analysis
Theorem (Mei et al. 2018, Chizat & Bach 2018): Under the following conditions, the mean-field equation converges globally to zero loss:
- Over-parameterization:
(or in the continuum limit, the support of is sufficiently large). - Positive Definiteness: The Neural Tangent Kernel
(NTK)
is positive definite on data points. - Initialization:
satisfies certain regularity conditions (e.g., Gaussian distribution).
Proof Sketch:
Step 1: Linearization. Under small learning rate or
in the NTK regime, network evolution can be approximated as
Step 2: Energy Dissipation. Define
Step 3: Exponential Convergence. Solving gives
NTK vs. Mean-Field Comparison:
NTK Limit (Jacot et al. 2018): Width
, fixed learning rate, parameters barely move (lazy training). Network linearizes near initialization.Mean-Field Limit: Width
, learning rate scales as , parameters move significantly. Captures global nonlinear dynamics.
Illustration: In parameter space, NTK corresponds to linear approximation in a small neighborhood, while mean-field describes large-scale particle flow.
Gradient Flow Representation on Wasserstein Space
Key Observation: The mean-field equation can be
written as a gradient flow in Wasserstein form
Theorem (Chizat & Bach 2018): If the loss can be
written as
Application: This formulation reveals global convexity of training — though the loss is non-convex with respect to parameters, in measure space the functional may be convex (displacement convexity).
Example (Quadratic Loss): For output layer training
(fixed features), the loss is
Continuous-Time Interpretation of Deep Networks
ResNet and ODE: Residual networks
Conditional Optimal Transport (Onken et al. 2021):
Training ResNets can be understood as learning conditional optimal
transport maps: given input
Theorem (Deep ResNets and Conditional
Optimal Transport): Training deep ResNets is equivalent to solving a
conditional optimal transport problem:
Significance: This perspective understands representation learning in deep learning as: learning an optimal map that progressively "flattens" the complex distribution in input space, transporting it layer by layer to an easily classifiable target space.
Experimental Validation: Bridging Theory and Practice
To validate the preceding theory, we design three sets of experiments: (1) visualizing gradient flow trajectories; (2) verifying the mean-field limit; (3) studying the effect of initialization on convergence.
Experiment 1: Gradient Flow Trajectory Visualization
We visualize continuous gradient flow trajectories versus discrete updates on different functions.
1 | import numpy as np |
Experiment Description:
Quadratic Function: Gradient flow is a linear system
, with trajectories being exponentially decaying ellipses. Discrete gradient descent (Euler method) highly agrees with continuous ODE solution (at small learning rates).Rosenbrock Function: The banana-shaped valley makes optimization difficult. Gradient descent zigzags in the valley, differing from the smooth continuous flow trajectory (at large learning rates).
Observation: Smaller learning rates make discrete trajectories closer to continuous gradient flow; but computational cost is higher. This motivates accelerated methods (momentum, Adam), which correspond to different continuous-time dynamics (Lagrangian mechanics, twisted Riemannian metrics).
Experiment 2: Mean-Field Limit Verification
We train two-layer neural networks of varying widths, observing particle density evolution and comparing with theoretical predictions.
1 | import torch |
Experiment Description:
Density Evolution: As training progresses, the weight distribution gradually evolves from the initial Gaussian. At small width (
), discreteness is evident (histogram shows significant fluctuations); as width increases ( ), the distribution becomes smoother, approaching a continuous density.Mean-Field Limit: Theory predicts density
satisfies a PDE. As , empirical density should converge to . In experiments, the distribution at is already quite smooth.Convergence Speed: Loss curves show that larger width leads to faster convergence (over-parameterization effect). Note that learning rate scales as
, ensuring parameter movement scale remains consistent (mean-field scaling).
Theory Comparison: If initialization
Experiment 3: Wasserstein Distance Computation
We use the Python Optimal Transport (POT) library to compute
Wasserstein distance between empirical distributions and target
distributions, verifying whether training decreases along
1 | import ot |
Experiment Description:
Distance Decrease: If training is a Wasserstein gradient flow of a functional,
should decrease monotonically. Experiments observe an overall decreasing trend, though fluctuations may occur (discrete updates, finite sample effects).Width Effect: Larger width makes empirical measure closer to continuous distribution, making
distance computation more stable.Theory Verification: This experiment directly validates the hypothesis that "training is a gradient flow on Wasserstein space." With appropriate choice of functional (e.g., Fisher-Rao metric in the paper Kernel Approximation of Fisher-Rao Gradient Flows), more precise correspondence can be obtained.
Experiment 4: Two-Layer Neural Network Loss Surface Visualization
1 | import torch |
Experiment Description:
Non-convexity: Loss surface has multiple saddle points and flat regions. ReLU activation causes zero gradient when
(dead neurons).Trajectory Depends on Initialization: Trajectories from different initial points converge to different local minima. But in the mean-field limit, global convergence is guaranteed (due to averaging effect of particle ensemble).
Symmetry: Loss surface is symmetric about the
axis (positive-negative symmetry of ReLU).
Fisher-Rao Gradient Flows and Conditional Gradient Flows
Fisher-Rao Metric and Natural Gradient
Besides the Wasserstein metric, the space of probability distributions has another important metric — the Fisher-Rao metric, which is the Riemannian metric on statistical manifolds.
Definition (Fisher Information Matrix): For a
parametric distribution
Theorem (Fisher-Rao Gradient Flow): For the KL
divergence functional
Comparison with Wasserstein Gradient Flow:
- Wasserstein: Measures "transport cost," suitable for describing particle movement.
- Fisher-Rao: Measures "information geometric distance," suitable for describing distribution shape changes.
Paper Kernel Approximation of Fisher-Rao Gradient Flows studies how to approximate Fisher-Rao gradient flows using kernel methods and applies them to sampling algorithms (e.g., Langevin dynamics).
Conditional Gradient Flows and Frank-Wolfe Algorithm
Problem: Minimize functional
PDE Interpretation of Adaptive Optimization Algorithms
Adam Optimizer (Kingma & Ba 2015) uses first and
second moment estimates:
Continuous-Time Limit: Formally, as step size
Geometric Interpretation: Adam is equivalent to
gradient flow under a coordinate-dependent Riemannian metric:
Theoretical Deepening: Recent Research Advances
Convergence of Mean-Field SGD
Standard mean-field theory assumes continuous time and full-batch gradient. But actual training uses stochastic gradient descent (SGD), involving noise and discreteness.
Paper Mean-Field Analysis of Neural
SGD-Ascent studies mean-field equations with noise:
- Noise Accelerates Convergence: Appropriate noise helps escape saddle points, accelerating convergence to global minimum.
- Fluctuation-Dissipation Relation: Relationship
between noise intensity, temperature, and batch size:
, where is batch size. - Implicit Regularization: SGD noise corresponds to
adding entropy regularization term
to loss, favoring flat minima (generalization).
Mean-Field Limit of Multi-Layer Networks
The preceding theory mainly targets two-layer networks. For deep networks, mean-field analysis is more complex, requiring consideration of inter-layer coupling.
Layered Mean-Field Equations: For an
- Asymmetric Coupling: Changes in shallow layers affect deep layers, but feedback occurs through backpropagation.
- Gradient Vanishing/Explosion: Gradient flow in deep networks may be temporally unstable.
- Residual Connections: Skip connections in ResNets alter flow structure, corresponding to symplectic geometry or volume-preserving flows.
Current Progress: The paper Deep ResNets and Conditional Optimal Transport understands ResNets as discrete-time optimal transport steps, providing a new analytical framework.
Double Descent in Over-Parameterization
Experimental Observation (Belkin et al. 2019): Test error versus model complexity exhibits a "double descent" curve — first decreasing, then increasing (overfitting), then decreasing again (over-parameterization regime).
Mean-Field Explanation: In the over-parameterization
regime (
Theorem (Implicit Bias): In the mean-field limit,
gradient flow converges to the maximum entropy
solution:
Lyapunov Function Construction for Neural Networks
For guaranteeing convergence, the key is finding a Lyapunov
function
- Training Loss:
. But only decreases monotonically under convex or PL conditions. - Free Energy:
. Combines loss and entropy. - Inter-Particle Distance:
.
Open Problem: For general non-convex losses and arbitrary depth networks, constructing a unified Lyapunov function remains a challenge.
Outlook: Future Directions from the PDE Perspective
Theoretical Directions
- Stronger Convergence Guarantees: Precise convergence rates for non-convex losses, finite width, and discrete time.
- Generalization Theory: Connecting mean-field limits with PAC learning and Rademacher complexity.
- Adversarial Robustness: Characterizing adversarial perturbations under Wasserstein metric, designing robust training algorithms.
- PDE Theory for Transformers: How do attention mechanisms, understood as integral operators, evolve?
Algorithmic Directions
- PDE Numerical Methods for Optimization: Using high-order ODE/PDE solvers (e.g., Runge-Kutta) to design new optimizers.
- Control Theory: Viewing hyperparameter tuning (learning rate, momentum) as optimal control problems.
- Sampling Algorithms: Using Langevin dynamics and Wasserstein gradient flows to design more efficient MCMC samplers (for Bayesian deep learning).
Application Directions
- Generative Models: Diffusion models are essentially reverse Fokker-Planck equations; PDE theory provides theoretical foundation.
- Reinforcement Learning: Policy gradients can be understood as gradient flows on policy space; mean-field methods analyze multi-agent systems.
- Scientific Computing: Deep Ritz Method and Physics-Informed Neural Networks (PINNs) transform PDE solving into optimization problems, using PDE theory in reverse to improve training.
Interdisciplinary Crossover
- Statistical Mechanics: Neural network training analogous to spin glass systems, phase transition phenomena.
- Optimal Control: Pontryagin's maximum principle for end-to-end optimization in deep learning.
- Differential Geometry: Deep applications of information geometry and symplectic geometry in optimization.
Summary
This article, starting from variational principles, systematically establishes a partial differential equation perspective on neural network optimization. We demonstrated:
Calculus of variations bridges discrete optimization and continuous dynamics, with Euler-Lagrange equations unifying physics, geometry, and optimization.
Wasserstein geometry provides a natural metric for the space of probability distributions; gradient flow theory unifies classical PDEs like heat and Fokker-Planck equations as gradient flows of energy functionals.
Mean-field limit understands training of finite-width neural networks as collective behavior of particle systems, converging under appropriate scaling to Vlasov-type PDEs, providing global convergence guarantees.
Experimental validation demonstrates correspondence between theoretical predictions and actual training: gradient flow trajectories, particle density evolution, Wasserstein distance decrease — all phenomena are clearly visible in numerical experiments.
Frontier advances include Fisher-Rao gradient flows, stochastic PDE theory for SGD, conditional optimal transport interpretations of deep networks, pointing toward future research directions.
This perspective not only deepens understanding of the essence of neural network optimization but also provides powerful tools for designing new algorithms, analyzing generalization, and constructing theoretical guarantees. As mathematics and machine learning continue to intersect, PDE theory will surely play an increasingly important role in deep learning.
References
L. Chizat and F. Bach, "On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport," NeurIPS, 2018. arXiv:1805.09545
S. Mei, A. Montanari, and P.-M. Nguyen, "A Mean Field View of the Landscape of Two-Layer Neural Networks," PNAS, 2018. arXiv:1804.06561
G. M. Rotskoff and E. Vanden-Eijnden, "Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error," arXiv:1805.00915, 2018.
A. Jacot, F. Gabriel, and C. Hongler, "Neural Tangent Kernel: Convergence and Generalization in Neural Networks," NeurIPS, 2018. arXiv:1806.07572
W. E and B. Yu, "The Deep Ritz Method: A Deep Learning-Based Numerical Algorithm for Solving Variational Problems," CPAM, 2018. arXiv:1710.00211
R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, "Neural Ordinary Differential Equations," NeurIPS, 2018. arXiv:1806.07366
L. Ambrosio, N. Gigli, and G. Savar é, Gradient Flows in Metric Spaces and in the Space of Probability Measures, Birkh ä user, 2008.
C. Villani, Optimal Transport: Old and New, Springer, 2009.
R. Jordan, D. Kinderlehrer, and F. Otto, "The Variational Formulation of the Fokker-Planck Equation," SIAM J. Math. Anal., 1998.
F. Otto, "The Geometry of Dissipative Evolution Equations: the Porous Medium Equation," Comm. PDE, 2001.
Mean-Field Analysis of Neural SGD-Ascent, Y. Lu and J. Lu, 2024.
Kernel Approximation of Fisher-Rao Gradient Flows, A. Kazeykina and M. Fornasier, 2024.
Deep ResNets and Conditional Optimal Transport, D. Onken et al., 2024.
M. Belkin, D. Hsu, S. Ma, and S. Mandal, "Reconciling Modern Machine Learning Practice and the Classical Bias-Variance Trade-off," PNAS, 2019.
S. Amari, "Natural Gradient Works Efficiently in Learning," Neural Computation, 1998.
D. P. Kingma and J. Ba, "Adam: A Method for Stochastic Optimization," ICLR, 2015. arXiv:1412.6980
Y. Li and Y. Liang, "Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data," NeurIPS, 2018.
L. Chizat, E. Oyallon, and F. Bach, "On Lazy Training in Differentiable Programming," NeurIPS, 2019. arXiv:1812.07956
G. Peyr é and M. Cuturi, "Computational Optimal Transport," Foundations and Trends in Machine Learning, 2019. arXiv:1803.00567
J. Sirignano and K. Spiliopoulos, "Mean Field Analysis of Neural Networks: A Central Limit Theorem," Stoch. Proc. Appl., 2020. arXiv:1808.09372
Code Repository: Complete experimental code and visualization scripts have been uploaded to the GitHub repository (please replace with actual link).
Acknowledgments: Thanks to anonymous reviewers for valuable feedback and in-depth discussions with my advisor on calculus of variations and optimization theory.
- Post title:PDE and Machine Learning (3): Variational Principles and Optimization
- Post author:Chen Kai
- Create time:2022-01-25 10:15:00
- Post link:https://www.chenk.top/pde-ml-3-variational-principles/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.