PDE and Machine Learning (4): Variational Inference and Fokker-Planck Equation

Probabilistic inference is one of the core problems in machine learning. Given observed data, we wish to infer the posterior distribution of latent variables or sample from complex high-dimensional distributions. Traditional methods fall into two main categories: Variational Inference (VI) approximates the posterior by optimizing a variational lower bound, while Markov Chain Monte Carlo (MCMC) samples by constructing Markov chains. These seemingly different approaches reveal profound unity when viewed through the lens of partial differential equations.

When we use Langevin dynamics for MCMC sampling, particle motion in a potential field is described by stochastic differential equations, with probability density evolution governed by the Fokker-Planck equation. When we optimize the variational lower bound using gradient descent, the evolution of parameter distributions in Wasserstein space can similarly be viewed as gradient flows of energy functionals. More remarkably, the KL divergence minimization process itself is the solution to the Fokker-Planck equation— variational inference and Langevin MCMC are completely equivalent in the continuous-time limit. This PDE perspective not only reveals the mathematical essence of probabilistic inference but also provides a unified theoretical framework for designing new inference algorithms such as Stein Variational Gradient Descent.

This article systematically establishes this theoretical framework. We begin with the Fokker-Planck equation, showing how to formalize the probability density evolution of stochastic processes as partial differential equations. We then delve into Langevin dynamics, discussing overdamped and underdamped cases, and the distinction between It ô and Stratonovich integrals. Next, we establish the gradient flow interpretation of KL divergence, proving the equivalence between variational inference and Langevin MCMC. Finally, we focus on advanced methods like Stein Variational Gradient Descent, demonstrating how to solve variational inference problems using particle systems, and validate theoretical predictions through four complete experiments.

Introduction: PDE Perspective on Probabilistic Inference

Fundamental Problems in Probabilistic Inference

Bayesian Inference: Given observed data and prior distribution, we wish to compute the posterior distributionFor complex models, the posterior is typically intractable (the integral in the denominator is difficult to compute), necessitating approximation methods.

Variational Inference: Approximate the posterior using a simple distribution familyby minimizing KL divergence:Equivalently, maximize the Evidence Lower BOund (ELBO): MCMC Methods: Construct Markov chains whose stationary distribution is the posterior. Common methods include Metropolis-Hastings, Gibbs sampling, and Langevin dynamics.

Unified Framework from PDE Perspective

From a PDE perspective, both methods involve evolution on the space of probability measures:

Langevin MCMC: Particlesfollow the stochastic differential equationwhereis the potential function andis the temperature parameter. The probability densityevolution is governed by the Fokker-Planck equation:
Variational Inference: The parameter distributionevolves in Wasserstein space, minimizing KL divergence. It can be shown that this evolution process is also described by the Fokker-Planck equation, differing only in initial and boundary conditions.
Stein Variational Gradient Descent: Approximate distributions using particle systems, where each particle followswhereis a kernel function. This can be viewed as a discretization of the Fokker-Planck equation under finite particle approximation.

Article Structure

This article is organized as follows:

Fokker-Planck Equation and Probability Density Evolution: Derive the Fokker-Planck equation from stochastic differential equations, discussing its physical meaning and mathematical properties.
Langevin Dynamics: Detailed discussion of overdamped and underdamped Langevin equations, the distinction between It ô and Stratonovich integrals, and numerical solution methods.
Gradient Flow Interpretation of KL Divergence: Prove that KL divergence minimization is equivalent to solving the Fokker-Planck equation, establishing connections between variational inference and Langevin MCMC.
Equivalence of Variational Inference and Langevin MCMC: Prove the equivalence of both methods in the continuous-time limit.
Stein Variational Gradient Descent: Introduce the SVGD method, demonstrating how to solve variational inference problems using particle systems.
Experimental Validation: Validate theoretical predictions through four experiments.

Fokker-Planck Equation and Probability Density Evolution

From Stochastic Differential Equations to Fokker-Planck Equation

Consider a general stochastic differential equation (SDE):whereis the drift term,is the diffusion term, andis an-dimensional standard Brownian motion.

The Fokker-Planck equation (also known as the Kolmogorov forward equation) describes the evolution of the probability density function:whereis the diffusion matrix.

Vector form:wheredenotes the divergence operator.

Derivation: From SDE to Fokker-Planck Equation

Approach: Consider an arbitrary smooth function, compute the time derivative of.

By It ô's lemma:Substituting the SDE:Taking expectation:On the other hand:Therefore:Applying integration by parts to the right-hand side (assuming boundary terms vanish): Therefore:Sinceis arbitrary, we obtain the Fokker-Planck equation.

Special Case: Fokker-Planck Equation for Langevin Dynamics

For overdamped Langevin dynamics:whereis the potential function andis the temperature parameter.

The corresponding Fokker-Planck equation is: Equilibrium Distribution: When, if an equilibrium distributionexists, then, so:It can be verified that the Gibbs distributionis the equilibrium distribution:Verification: Therefore: ### Properties of the Fokker-Planck Equation

Probability Conservation: If the initial distribution satisfies, then for anywe have.

Proof: (by the divergence theorem, boundary terms vanish)

Entropy Increase Principle: For diffusion processes, entropy (negative log-likelihood) increases over time: Substituting the Fokker-Planck equation, it can be shown that(when the diffusion matrix is positive definite).

H-Theorem: For Langevin dynamics, if the equilibrium distribution is the Gibbs distribution, then relative entropy (KL divergence) decreases monotonically:This guarantees convergence of the distribution to equilibrium.

Langevin Dynamics: Overdamped and Underdamped

Overdamped Langevin Equation

Physical Background: In a viscous medium, particle inertia can be neglected, and motion is entirely driven by friction and random forces.

Overdamped Langevin Equation:whereis the friction coefficient. Typically settinggives the standard form: Corresponding Fokker-Planck Equation: Numerical Solution: Euler-Maruyama method (first-order):whereis the step size and.

Improved Method: Metropolis-adjusted Langevin algorithm (MALA) adds a Metropolis accept-reject step after each iteration, ensuring exact sampling.

Underdamped Langevin Equation

Physical Background: Considering particle inertia, motion is described by both position and velocity.

Underdamped Langevin Equation (second-order SDE):whereis velocity andis the friction coefficient.

Phase Space Fokker-Planck Equation: The probability densityevolves inphase space:whereanddenote gradients with respect to position and velocity, respectively.

Equilibrium Distribution: Gibbs distribution Numerical Solution: Hamiltonian Monte Carlo (HMC) can be viewed as a discretization of underdamped Langevin dynamics.

It ô vs Stratonovich Integral

In stochastic differential equations, there are two ways to define integration:

It ô Integral: Stratonovich Integral: Distinction: - It ô integral: integrand evaluated at left endpoint of interval, does not satisfy chain rule - Stratonovich integral: integrand evaluated at midpoint of interval, satisfies chain rule (similar to classical calculus)

Conversion Relation: For the SDEequivalent to It ô form: $ô$ Application in Langevin Dynamics: Typically use It ô integral because: 1. Mathematically simpler (martingale property) 2. More direct numerical methods (Euler-Maruyama) 3. More concise Fokker-Planck equation form

Gradient Flow Interpretation of KL Divergence

Wasserstein Gradient Flow

Wasserstein Distance: For two probability measures, the-Wasserstein distance is defined as:whereis the set of all coupling measures with marginalsand.

Wasserstein Gradient Flow: Consider an energy functional, its Wasserstein gradient flow is:whereis the functional derivative.

KL Divergence as Energy Functional

Consider KL divergence:whereis the target distribution.

Functional Derivative: Wasserstein Gradient Flow:Expanding:Since, we have, therefore:This is exactly the Fokker-Planck equation (when)!

Equivalence of Gradient Flow and Langevin Dynamics

Theorem: The Wasserstein gradient flow of KL divergence is equivalent to the Fokker-Planck equation of Langevin dynamics.

Proof: We have seen that the gradient flow of KL divergence gives:While the Fokker-Planck equation for Langevin dynamicsis:They are identical. Physical Meaning: - Variational Inference: In Wasserstein space, evolving along the negative gradient of KL divergence, gradually bringing the distribution closer to the target distribution - Langevin MCMC: Particles move in a potential field, with probability density evolution also bringing the distribution closer to the target distribution - Unity: Both methods are completely equivalent in the continuous-time limit

Convergence Analysis

Theorem (Convergence): If the target distributionsatisfies the log-Sobolev inequality, then KL divergence converges at exponential rate:whereis the log-Sobolev constant.

Log-Sobolev Inequality: There exists a constantsuch that for any function:whereis the entropy.

Connection Between Variational Inference and Langevin MCMC

Discrete-Time Perspective

Variational Inference: Optimize parametersso thatapproximates: Langevin MCMC: Sample particlesso that the empirical distribution approximates: ### Continuous-Time Limit

Continuous-Time Limit of Variational Inference: When the parameter distributionevolves, if using natural gradient, then:This is exactly the Wasserstein gradient flow of KL divergence.

Continuous-Time Limit of Langevin MCMC: When the number of particles, the evolution of the empirical distributionis described by the Fokker-Planck equation.

Equivalence: Both methods give the same PDE in the continuous-time limit, hence equivalent.

Differences in Practical Applications

Although theoretically equivalent, in practical applications:

Variational Inference:
- Advantages: High computational efficiency (one forward pass), parallelizable
- Disadvantages: Requires choosing variational family, may have approximation error
Langevin MCMC:
- Advantages: Asymptotically exact (as), no need to choose variational family
- Disadvantages: Requires long runtime, difficult to parallelize
Hybrid Methods: Combine advantages of both, such as Variational Langevin Dynamics.

Stein Variational Gradient Descent (SVGD)

Motivation: Particle Variational Inference

Traditional variational inference requires choosing a parameterized distribution family (e.g., Gaussian), which limits expressiveness. Stein Variational Gradient Descent (SVGD) approximates distributions using particle systemswithout explicit parameterization.

Foundations of Stein Method

Stein Operator: For a function(kernel function), define the Stein operator:whereis the target distribution.

Stein Identity: If, then for any function: Stein Discrepancy:whereis a function class (typically taken as Reproducing Kernel Hilbert Space RKHS).

SVGD Algorithm

Objective: Minimize Stein discrepancy, whereis the empirical distribution of the particle system.

Key Insight: In RKHS, the optimal direction function is: SVGD Update Rule:whereis the step size.

PDE Interpretation of SVGD

Continuous-Time Limit: When the number of particles, the evolution of the empirical distributionis described by the following PDE:where.

Expanding:This can be viewed as a non-local Fokker-Planck equation, where the kernel functionintroduces interactions between particles.

SVGD vs Langevin MCMC

Similarities: - Both use particle systems - Both evolve distributions to approach the target distribution

Differences: - Langevin MCMC: Each particle evolves independently, exploring through random noise - SVGD: Particles interact through kernel functions, forming repulsive forces that prevent particle clustering

Advantages: - SVGD typically converges faster (particle interactions) - SVGD does not require random noise, deterministic updates - SVGD can handle multimodal distributions (particles automatically disperse to different modes)

Experimental Validation

Experiment 1: One-Dimensional Fokker-Planck Evolution Visualization

Objective: Visualize solutions of the Fokker-Planck equation, showing how probability density evolves to equilibrium distribution.

Setup: - Potential function:(bimodal) - Initial distribution: - Equilibrium distribution: Method: Numerically solve the Fokker-Planck equation (finite difference method). Complete code available in experiment1_fokker_planck.py.

Key Steps: 1. Spatial discretization: Discretizeinto 200 points 2. Time integration: Use scipy.integrate.odeint to solve the ODE system 3. Boundary conditions: Use zero boundary conditions (reflecting boundaries) 4. Visualization: Show probability density evolution at different times, and KL divergence over time

Expected Results: - Initial distribution concentrated near - Over time, distribution gradually diffuses and moves to bimodal positions - Finally converges to equilibrium distribution (bimodal structure)

Experiment 2: Langevin Dynamics Sampling (Multimodal Distribution)

Objective: Sample from multimodal distributions using Langevin dynamics, comparing effects of different temperature parameters.

Setup: - Target distribution:where - Sampling method: Euler-Maruyama discretization - Temperature parameters:Complete code available in experiment2_langevin_sampling.py.

Key Steps: 1. Initialization: 1000 particles, initial positions concentrated near origin 2. Langevin update: Use Euler-Maruyama method, step size3. Burn-in period: First 1000 steps as burn-in, not used for statistics 4. Visualization: Compare sampling distributions under different temperature parameters, and trajectories of individual particles

Expected Results: - Low temperature (large): Samples more concentrated near peaks - High temperature (small): Samples more dispersed, better exploration of entire distribution - In all cases, sampling distribution should approximate true distribution

Experiment 3: VI vs MCMC Convergence Comparison

Objective: Compare convergence rates of variational inference and Langevin MCMC.

Setup: - Target distribution:where - Variational Inference: Use Gaussian distribution - MCMC: Langevin dynamics

Complete code available in experiment3_vi_vs_mcmc.py.

Key Steps: 1. Variational Inference: Use Adam optimizer to minimize KL divergence, parameters2. Langevin MCMC: Use 100 particles for sampling, compute KL divergence every 10 steps 3. KL divergence computation: VI uses Monte Carlo estimation, MCMC uses KDE to estimate empirical distribution 4. Visualization: Compare convergence rates and final distributions of both methods

Expected Results: - Variational Inference: Fast convergence (but may have approximation error) - MCMC: Slower convergence (but asymptotically exact)

Experiment 4: SVGD Particle Trajectories and Density Estimation

Objective: Visualize SVGD particle evolution process, showing how particles disperse to different modes of the target distribution.

Setup: - Target distribution:where(bimodal) - Initial particles: Concentrated near - Kernel function: RBF kernel, bandwidthComplete code available in experiment4_svgd.py.

Key Steps: 1. Initialization: 100 particles, initial positions concentrated near origin 2. SVGD update: Use RBF kernel function, step size3. Visualization: Show particle distribution evolution, particle trajectories, inter-particle distance changes, and KL divergence convergence

Expected Results: - Initial: Particles concentrated near - During evolution: Particles gradually disperse to two peaks - Final: Particle distribution approximates target distribution (bimodal structure)

Series Summary: Unified Perspective of PDE and Machine Learning

Through systematic exploration across eight articles, we have established a complete theoretical framework for understanding machine learning from the perspective of partial differential equations. review the core ideas of this series:

Core Theme Review

Variational Principles and Neural Network Optimization: Viewing neural network training as gradient flows in Wasserstein space, revealing the continuous dynamics nature of optimization processes.
Physics-Informed Neural Networks: Transforming PDE solving problems into optimization problems, demonstrating applications of deep learning methods in scientific computing.
Neural Operator Theory: Learning infinite-dimensional mappings, unifying learning problems on function spaces.
PDE Nature of Diffusion Models: Revealing the mathematical foundations of generative models, establishing connections between SDEs, Fokker-Planck equations, and Score Matching.
Continuous Normalizing Flows: Connecting flow models with ordinary differential equations, providing another perspective on generative models.
PDE Theory of Variational Inference: Unifying variational inference and MCMC methods, demonstrating continuous-time dynamics of probabilistic inference.

Unified Mathematical Framework

All these themes revolve around a core idea: machine learning problems can be viewed as solving partial differential equations.

Optimization problems → Wasserstein gradient flows
Sampling problems → Fokker-Planck equations
Function learning → Operator equations
Generative models → Diffusion equations

Future Directions

More Complex PDE Structures: Explore applications of fractional PDEs, stochastic PDEs, etc. in machine learning.
Improvements in Numerical Methods: Apply classical PDE numerical methods (finite elements, spectral methods) to machine learning.
Theoretical Analysis: Establish stricter convergence and stability theories.
Practical Applications: Apply PDE perspectives to practical problems such as scientific computing, financial modeling, etc.

Conclusion

The intersection of partial differential equations and machine learning is rapidly developing. Through this series of articles, we hope readers can:

Understand Mathematical Essence: Deeply understand the theoretical foundations of machine learning from a PDE perspective
Master Practical Tools: Learn to use PDE methods to solve practical problems
Explore Frontier Directions: Understand current research hotspots and future development directions

The beauty of mathematics lies in unity. When we re-examine machine learning using the language of partial differential equations, seemingly different methods reveal profound connections. This unity not only deepens our understanding but also provides powerful tools for designing new algorithms.

References

Langevin Diffusion Variational Inference. arXiv:2208.07743
Variational Inference as Parametric Langevin Dynamics. ICML 2020
Fokker-Planck Transport. arXiv:2410.18993
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. arXiv:1608.04471
Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. Journal of the American Statistical Association, 112(518), 859-877.
Ambrosio, L., Gigli, N., & Savar é, G. (2008). Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Birkh ä user.
Villani, C. (2009). Optimal Transport: Old and New. Springer.