Probabilistic inference is one of the core problems in machine learning. Given observed data, we wish to infer the posterior distribution of latent variables or sample from complex high-dimensional distributions. Traditional methods fall into two main categories: Variational Inference (VI) approximates the posterior by optimizing a variational lower bound, while Markov Chain Monte Carlo (MCMC) samples by constructing Markov chains. These seemingly different approaches reveal profound unity when viewed through the lens of partial differential equations.
When we use Langevin dynamics for MCMC sampling, particle motion in a potential field is described by stochastic differential equations, with probability density evolution governed by the Fokker-Planck equation. When we optimize the variational lower bound using gradient descent, the evolution of parameter distributions in Wasserstein space can similarly be viewed as gradient flows of energy functionals. More remarkably, the KL divergence minimization process itself is the solution to the Fokker-Planck equation— variational inference and Langevin MCMC are completely equivalent in the continuous-time limit. This PDE perspective not only reveals the mathematical essence of probabilistic inference but also provides a unified theoretical framework for designing new inference algorithms such as Stein Variational Gradient Descent.
This article systematically establishes this theoretical framework. We begin with the Fokker-Planck equation, showing how to formalize the probability density evolution of stochastic processes as partial differential equations. We then delve into Langevin dynamics, discussing overdamped and underdamped cases, and the distinction between It ô and Stratonovich integrals. Next, we establish the gradient flow interpretation of KL divergence, proving the equivalence between variational inference and Langevin MCMC. Finally, we focus on advanced methods like Stein Variational Gradient Descent, demonstrating how to solve variational inference problems using particle systems, and validate theoretical predictions through four complete experiments.
Introduction: PDE Perspective on Probabilistic Inference
Fundamental Problems in Probabilistic Inference
Bayesian Inference: Given observed data
Variational Inference: Approximate the posterior
using a simple distribution family
Unified Framework from PDE Perspective
From a PDE perspective, both methods involve evolution on the space of probability measures:
Langevin MCMC: Particles
follow the stochastic differential equation where is the potential function and is the temperature parameter. The probability density evolution is governed by the Fokker-Planck equation: Variational Inference: The parameter distribution
evolves in Wasserstein space, minimizing KL divergence. It can be shown that this evolution process is also described by the Fokker-Planck equation, differing only in initial and boundary conditions. Stein Variational Gradient Descent: Approximate distributions using particle systems
, where each particle follows where is a kernel function. This can be viewed as a discretization of the Fokker-Planck equation under finite particle approximation.
Article Structure
This article is organized as follows:
Fokker-Planck Equation and Probability Density Evolution: Derive the Fokker-Planck equation from stochastic differential equations, discussing its physical meaning and mathematical properties.
Langevin Dynamics: Detailed discussion of overdamped and underdamped Langevin equations, the distinction between It ô and Stratonovich integrals, and numerical solution methods.
Gradient Flow Interpretation of KL Divergence: Prove that KL divergence minimization is equivalent to solving the Fokker-Planck equation, establishing connections between variational inference and Langevin MCMC.
Equivalence of Variational Inference and Langevin MCMC: Prove the equivalence of both methods in the continuous-time limit.
Stein Variational Gradient Descent: Introduce the SVGD method, demonstrating how to solve variational inference problems using particle systems.
Experimental Validation: Validate theoretical predictions through four experiments.
Fokker-Planck Equation and Probability Density Evolution
From Stochastic Differential Equations to Fokker-Planck Equation
Consider a general stochastic differential equation (SDE):
The Fokker-Planck equation (also known as the
Kolmogorov forward equation) describes the evolution of the probability
density function
Vector form:
Derivation: From SDE to Fokker-Planck Equation
Approach: Consider an arbitrary smooth function
By It ô's lemma:
Special Case: Fokker-Planck Equation for Langevin Dynamics
For overdamped Langevin dynamics:
The corresponding Fokker-Planck equation is:
Probability Conservation: If the initial
distribution satisfies
Proof:
Entropy Increase Principle: For diffusion processes,
entropy (negative log-likelihood) increases over time:
H-Theorem: For Langevin dynamics, if the equilibrium
distribution is the Gibbs distribution, then relative entropy (KL
divergence) decreases monotonically:
Langevin Dynamics: Overdamped and Underdamped
Overdamped Langevin Equation
Physical Background: In a viscous medium, particle inertia can be neglected, and motion is entirely driven by friction and random forces.
Overdamped Langevin Equation:
Improved Method: Metropolis-adjusted Langevin algorithm (MALA) adds a Metropolis accept-reject step after each iteration, ensuring exact sampling.
Underdamped Langevin Equation
Physical Background: Considering particle inertia, motion is described by both position and velocity.
Underdamped Langevin Equation (second-order
SDE):
Phase Space Fokker-Planck Equation: The probability
density
Equilibrium Distribution: Gibbs distribution
It ô vs Stratonovich Integral
In stochastic differential equations, there are two ways to define integration:
It ô Integral:
Conversion Relation: For the SDE
Gradient Flow Interpretation of KL Divergence
Wasserstein Gradient Flow
Wasserstein Distance: For two probability
measures
Wasserstein Gradient Flow: Consider an energy
functional
KL Divergence as Energy Functional
Consider KL divergence:
Functional Derivative:
Equivalence of Gradient Flow and Langevin Dynamics
Theorem: The Wasserstein gradient flow of KL divergence is equivalent to the Fokker-Planck equation of Langevin dynamics.
Proof: We have seen that the gradient flow of KL
divergence gives:
Convergence Analysis
Theorem (Convergence): If the target
distribution
Log-Sobolev Inequality: There exists a constant
Connection Between Variational Inference and Langevin MCMC
Discrete-Time Perspective
Variational Inference: Optimize parameters
Continuous-Time Limit of Variational Inference: When
the parameter distribution
Continuous-Time Limit of Langevin MCMC: When the
number of particles
Equivalence: Both methods give the same PDE in the continuous-time limit, hence equivalent.
Differences in Practical Applications
Although theoretically equivalent, in practical applications:
- Variational Inference:
- Advantages: High computational efficiency (one forward pass), parallelizable
- Disadvantages: Requires choosing variational family, may have approximation error
- Langevin MCMC:
- Advantages: Asymptotically exact (as
), no need to choose variational family - Disadvantages: Requires long runtime, difficult to parallelize
- Advantages: Asymptotically exact (as
- Hybrid Methods: Combine advantages of both, such as Variational Langevin Dynamics.
Stein Variational Gradient Descent (SVGD)
Motivation: Particle Variational Inference
Traditional variational inference requires choosing a parameterized
distribution family (e.g., Gaussian), which limits expressiveness.
Stein Variational Gradient Descent (SVGD) approximates
distributions using particle systems
Foundations of Stein Method
Stein Operator: For a function
Stein Identity: If
SVGD Algorithm
Objective: Minimize Stein discrepancy
Key Insight: In RKHS, the optimal direction function
is:
PDE Interpretation of SVGD
Continuous-Time Limit: When the number of
particles
Expanding:
SVGD vs Langevin MCMC
Similarities: - Both use particle systems - Both evolve distributions to approach the target distribution
Differences: - Langevin MCMC: Each particle evolves independently, exploring through random noise - SVGD: Particles interact through kernel functions, forming repulsive forces that prevent particle clustering
Advantages: - SVGD typically converges faster (particle interactions) - SVGD does not require random noise, deterministic updates - SVGD can handle multimodal distributions (particles automatically disperse to different modes)
Experimental Validation
Experiment 1: One-Dimensional Fokker-Planck Evolution Visualization
Objective: Visualize solutions of the Fokker-Planck equation, showing how probability density evolves to equilibrium distribution.
Setup: - Potential function:experiment1_fokker_planck.py.
Key Steps: 1. Spatial discretization:
Discretizescipy.integrate.odeint to
solve the ODE system 3. Boundary conditions: Use zero boundary
conditions (reflecting boundaries) 4. Visualization: Show probability
density evolution at different times, and KL divergence over time
Expected Results: - Initial distribution
concentrated near
Experiment 2: Langevin Dynamics Sampling (Multimodal Distribution)
Objective: Sample from multimodal distributions using Langevin dynamics, comparing effects of different temperature parameters.
Setup: - Target distribution:experiment2_langevin_sampling.py.
Key Steps: 1. Initialization: 1000 particles,
initial positions concentrated near origin 2. Langevin update: Use
Euler-Maruyama method, step size
Expected Results: - Low temperature (
Experiment 3: VI vs MCMC Convergence Comparison
Objective: Compare convergence rates of variational inference and Langevin MCMC.
Setup: - Target distribution:
Complete code available in
experiment3_vi_vs_mcmc.py.
Key Steps: 1. Variational Inference: Use Adam
optimizer to minimize KL divergence, parameters
Expected Results: - Variational Inference: Fast convergence (but may have approximation error) - MCMC: Slower convergence (but asymptotically exact)
Experiment 4: SVGD Particle Trajectories and Density Estimation
Objective: Visualize SVGD particle evolution process, showing how particles disperse to different modes of the target distribution.
Setup: - Target distribution:experiment4_svgd.py.
Key Steps: 1. Initialization: 100 particles, initial
positions concentrated near origin 2. SVGD update: Use RBF kernel
function, step size
Expected Results: - Initial: Particles concentrated
near
Series Summary: Unified Perspective of PDE and Machine Learning
Through systematic exploration across eight articles, we have established a complete theoretical framework for understanding machine learning from the perspective of partial differential equations. review the core ideas of this series:
Core Theme Review
Variational Principles and Neural Network Optimization: Viewing neural network training as gradient flows in Wasserstein space, revealing the continuous dynamics nature of optimization processes.
Physics-Informed Neural Networks: Transforming PDE solving problems into optimization problems, demonstrating applications of deep learning methods in scientific computing.
Neural Operator Theory: Learning infinite-dimensional mappings, unifying learning problems on function spaces.
PDE Nature of Diffusion Models: Revealing the mathematical foundations of generative models, establishing connections between SDEs, Fokker-Planck equations, and Score Matching.
Continuous Normalizing Flows: Connecting flow models with ordinary differential equations, providing another perspective on generative models.
PDE Theory of Variational Inference: Unifying variational inference and MCMC methods, demonstrating continuous-time dynamics of probabilistic inference.
Unified Mathematical Framework
All these themes revolve around a core idea: machine learning problems can be viewed as solving partial differential equations.
- Optimization problems → Wasserstein gradient flows
- Sampling problems → Fokker-Planck equations
- Function learning → Operator equations
- Generative models → Diffusion equations
Future Directions
More Complex PDE Structures: Explore applications of fractional PDEs, stochastic PDEs, etc. in machine learning.
Improvements in Numerical Methods: Apply classical PDE numerical methods (finite elements, spectral methods) to machine learning.
Theoretical Analysis: Establish stricter convergence and stability theories.
Practical Applications: Apply PDE perspectives to practical problems such as scientific computing, financial modeling, etc.
Conclusion
The intersection of partial differential equations and machine learning is rapidly developing. Through this series of articles, we hope readers can:
- Understand Mathematical Essence: Deeply understand the theoretical foundations of machine learning from a PDE perspective
- Master Practical Tools: Learn to use PDE methods to solve practical problems
- Explore Frontier Directions: Understand current research hotspots and future development directions
The beauty of mathematics lies in unity. When we re-examine machine learning using the language of partial differential equations, seemingly different methods reveal profound connections. This unity not only deepens our understanding but also provides powerful tools for designing new algorithms.
References
Langevin Diffusion Variational Inference. arXiv:2208.07743
Variational Inference as Parametric Langevin Dynamics. ICML 2020
Fokker-Planck Transport. arXiv:2410.18993
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. arXiv:1608.04471
Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. Journal of the American Statistical Association, 112(518), 859-877.
Ambrosio, L., Gigli, N., & Savar é, G. (2008). Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Birkh ä user.
Villani, C. (2009). Optimal Transport: Old and New. Springer.
- Post title:PDE and Machine Learning (4): Variational Inference and Fokker-Planck Equation
- Post author:Chen Kai
- Create time:2022-02-05 09:45:00
- Post link:https://www.chenk.top/pde-ml-4-variational-inference-fokker-planck/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.