PDE and Machine Learning (7): Diffusion Models and Score Matching

The core task of generative models is to sample from data distributions. Traditional approaches like Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) achieve this through explicit encoder-decoder structures or adversarial training. However, since 2020, diffusion models have rapidly emerged as the dominant paradigm in generative AI, celebrated for their exceptional generation quality and training stability. From DALL · E 2 to Stable Diffusion, from image generation to text-to-image synthesis, diffusion models are reshaping our understanding of generative AI.

Yet beneath the success of diffusion models lies a profound mathematical structure: they are essentially numerical solvers for partial differential equations (PDEs). When we add Gaussian noise to data, we are actually solving a forward diffusion process whose probability density evolution is governed by the Fokker-Planck equation; when we learn denoising models, we are actually learning Score functions whose gradients guide the reverse diffusion process; when we use DDPM or DDIM sampling, we are actually numerically solving stochastic or deterministic ordinary differential equations. This PDE perspective not only reveals the mathematical essence of diffusion models but also provides a unified framework for understanding their convergence, designing new sampling algorithms, and extending to conditional generation tasks.

This article systematically establishes this theoretical framework. We begin with classical heat equations, introducing fundamental tools such as Fick's law, Gaussian kernels, and Fourier transforms. We then introduce stochastic differential equations (SDEs) and the Fokker-Planck equation, demonstrating how diffusion processes can be formalized as probability density evolution. Next, we focus on Score-Based generative models, deriving Score Matching objective functions and establishing connections between Langevin dynamics and sampling processes. Finally, we delve into DDPM and DDIM, showing how they serve as discretization schemes for SDEs/ODEs, and validate theoretical predictions through four complete experiments.

Heat Equation and Diffusion Processes: From Fick's Law to Gaussian Kernels

Fick's Law and the Diffusion Equation

Diffusion phenomena are ubiquitous in nature: a drop of ink gradually disperses in clear water, heat transfers from high-temperature regions to low-temperature regions, molecules move driven by concentration gradients. The mathematical description of these processes all reduces to the diffusion equation, also known as the heat equation.

Fick's First Law (1855): The diffusion fluxis proportional to the concentration gradient, in the opposite direction:whereis the diffusion coefficient,is the concentration (or probability density) at positionand time, andis the gradient operator.

Mass Conservation: Consider a spatial region. The rate of mass change equals the incoming flux:whereis the outward normal vector. Applying the divergence theorem:Since the regionis arbitrary, we obtain the continuity equation:Substituting Fick's law, we obtain the diffusion equation (one-dimensional form):

Higher-dimensional form:whereis the Laplacian operator.

Gaussian Kernels: Fundamental Solutions of the Diffusion Equation

The diffusion equation has analytical solutions, and its fundamental solution is the Gaussian kernel.

One-dimensional case: Consider the initial condition (Dirac delta function). The solution to the diffusion equation is:This is a Gaussian distribution with mean 0 and variance. The variance grows linearly with time, reflecting the "smoothing" effect of the diffusion process.

Higher-dimensional case: In, the fundamental solution is: General initial conditions: For an arbitrary initial distribution, the solution is given by convolution:whereis the Gaussian kernel anddenotes convolution.

Physical interpretation: - Sharp distributions at the initial time (such as Dirac delta) gradually "diffuse" over time, with increasing variance - Any initial distribution can be viewed as a linear combination of Dirac deltas, so the solution is the convolution of the initial distribution with the Gaussian kernel - As, the distribution tends to uniform (if the domain is bounded) or to zero (if the domain is unbounded)

Fourier Transform and Spectral Methods

The diffusion equation has a concise form in the Fourier domain, providing powerful tools for theoretical analysis and numerical solution.

Fourier transform: Define the Fourier transform of a functionas:The inverse transform is:

Key properties: - Derivative property: - Convolution property: - Laplacian operator: Fourier form of the diffusion equation: Taking the Fourier transform of both sides of:This is an ordinary differential equation for, with solution:

Physical interpretation: - High-frequency components (large) decay faster, reflecting the "low-pass filtering" property of diffusion processes - The decay rategrows quadratically with frequency, indicating rapid elimination of high-frequency noise - This explains why diffusion processes can "smooth" initial distributions

Numerical solution: In the Fourier domain, solutions to the diffusion equation can be written explicitly, providing a foundation for efficient numerical methods. For periodic boundary conditions, Fast Fourier Transform (FFT) can achievecomplexity.

Stochastic Differential Equations and the Fokker-Planck Equation

It ô Integral and Stochastic Differential Equations

Diffusion processes can be naturally described using Stochastic Differential Equations (SDEs). This provides a rigorous framework for understanding the randomness in diffusion models.

Brownian Motion: Standard Brownian motionsatisfies: - - Independent increments:is independent of - Normal increments: - Continuous paths but almost nowhere differentiable

It ô Integral: For an adapted process, the It ô integral is defined as:whereis a partition of the interval.

Key properties: - Zero mean: - It ô isometry: - Martingale property:is a martingale

Stochastic Differential Equation: The general form of an SDE is:whereis the drift term,is the diffusion coefficient, andis standard Brownian motion.

Forward diffusion SDE: In diffusion models, the forward process is typically written as:whereis the drift function andis the diffusion coefficient function. Common choices include: - Variance Preserving (VP):, - Variance Exploding (VE):, - Sub-Variance Preserving (sub-VP): Intermediate between VP and VE

Fokker-Planck Equation: Evolution of Probability Density

The Fokker-Planck equation describes how the probability density of an SDE solution evolves over time, serving as a bridge between stochastic processes and PDEs.

Theorem (Fokker-Planck Equation): Letsatisfy the SDE:Then its probability densitysatisfies the Fokker-Planck equation (also known as the Kolmogorov forward equation): One-dimensional case: Proof sketch: For any smooth test function, apply It ô's lemma:Taking expectation and using:On the other hand:Therefore:Integrating by parts and using the arbitrariness of, we obtain the Fokker-Planck equation. Special case: diffusion equation: Whenand, the Fokker-Planck equation reduces to the diffusion equation:This verifies the equivalence between the diffusion equation and stochastic processes.

Kolmogorov Backward Equation

The Kolmogorov backward equation describes the evolution of conditional expectations and plays a key role in the sampling process of diffusion models.

Theorem (Kolmogorov Backward Equation): Letsatisfy an SDE. For any function, the conditional expectationsatisfies:with boundary condition.

Physical interpretation: - The forward equation describes probability density evolution forward from the initial time - The backward equation describes conditional expectation evolution backward from the terminal time - They are connected through the Feynman-Kac formula

Application in diffusion models: The backward equation is used to derive the reverse diffusion SDE, which is the theoretical foundation of Score-Based generative models.

Score-Based Generative Models: From Score Functions to Langevin Dynamics

Score Function: Logarithmic Gradient of Probability Density

Definition (Score Function): Letbe a probability density function. Its Score function is defined as: Key properties: - The Score function is independent of normalization constants: if, then - Score function of Gaussian distribution: if, then - Expectation of Score function is zero: Why learn the Score function? 1. No normalization needed: Only need to learn the unnormalized energy function, avoiding computation of partition functions 2. Guided sampling: The Score function points in the direction of increasing probability density, enabling Langevin dynamics sampling 3. Connection to PDEs: The Score function is closely related to probability flow in the Fokker-Planck equation

Score Matching: Learning Score Functions

Objective function: Given a data distribution, we wish to learn a neural networkto approximate the true Score function.

Explicit Score Matching (ESM): Minimize:Butis unknown and cannot be directly computed.

Implicit Score Matching (ISM): Through integration by parts, it can be shown that:whereis the trace of the Jacobian matrix of the Score network.

Denoising Score Matching (DSM): Add noise to dataand learn the Score function of the noisy distribution: Sliced Score Matching (SSM): To reduce computational complexity, one can match only the projection of the Score function onto random directions: ### Langevin Dynamics: Sampling with Score Functions

Langevin dynamics: Given a Score function, one can sample using the following SDE: Discretization (Langevin MCMC):whereis the step size.

Theoretical guarantee: Under mild conditions, asand step size, the distribution ofconverges to.

Geometric intuition: - The Score functionpoints in the direction of increasing probability density - The deterministic termpushes samples toward high-probability regions - The stochastic termprovides exploration, avoiding local optima

Forward Diffusion and Reverse Sampling

Forward diffusion SDE: Starting from the data distribution, evolve to the prior distribution(typically standard Gaussian) by adding noise: Reverse diffusion SDE: According to Anderson's theorem, the reverse process satisfies:whereis reverse-time Brownian motion andis the Score function at time.

Key insight: - The drift term of the reverse SDE contains the Score function - If we can learn the Score functionat each time, we can generate samples by numerically solving the reverse SDE - This is the core idea of Score-Based generative models

DDPM and DDIM: A Discretization Perspective

DDPM: Discretization of Forward and Reverse Processes

Denoising Diffusion Probabilistic Models (DDPM) (Ho et al., 2020) is one of the earliest successful diffusion models, discretizing the continuous diffusion process into finite steps.

Forward process: Define discrete time steps. The forward process is:whereis a predefined noise schedule.

Key properties: - One can analytically compute:whereand.

Whenand, the distribution oftends to standard Gaussian.

Reverse process: Learn the reverse distribution: Training objective: Maximizing the lower bound of log-likelihood is equivalent to minimizing:where,, andis a neural network predicting noise.

Connection to Score Matching: It can be shown that the DDPM loss function is equivalent to weighted Score Matching:whereis the learned Score function.

DDIM: Deterministic Sampling

Denoising Diffusion Implicit Models (DDIM) (Song et al., 2021) converts DDPM's stochastic sampling process into a deterministic process, enabling fast sampling through ODE solving.

Key observation: The DDPM forward process can be viewed as a discretization of the following SDE:The corresponding Probability Flow ODE is: DDIM sampling: Using the trained Score function, generate samples by numerically solving the ODE: Advantages: - Deterministic: Given initial noise, generation results are deterministic - Fast sampling: Can use large step sizes, reducing the number of sampling steps - Reversibility: Can precisely encode images to latent space

Unified Continuous-Time Perspective

SDE form: Forward diffusion SDE:Reverse SDE: ODE form (Probability Flow ODE): Numerical solvers: - Euler-Maruyama: Explicit Euler method for SDEs - Heun's method: Second-order method for ODEs - Runge-Kutta: Higher-order ODE methods - Predictor-corrector: Hybrid methods combining SDEs and ODEs

Trade-off between sampling quality and speed: - SDE sampling: Better randomness but requires more steps - ODE sampling: Deterministic, can use large step sizes, but may lose details - Hybrid methods: Use SDE exploration in early stages, ODE refinement in later stages

Experiments: From Theory to Practice

Experiment 1: One-Dimensional Diffusion Process Visualization

We first visualize the diffusion process in one dimension to validate theoretical predictions.

Setup: Initial distribution is a mixture of Gaussians:Forward diffusion SDE:where(constant).

Theoretical predictions: - Probability density evolution is governed by the Fokker-Planck equation - Analytical solutions can be computed via convolution - As, the distribution tends to a unimodal Gaussian

Implementation: We use numerical methods to solve the Fokker-Planck equation and visualize the evolution of probability density. The code is provided in the accompanying Python file diffusion_pde_experiments.py.

Results analysis: - The initial bimodal distribution gradually "diffuses" over time, with decreasing peak heights - Around, the distribution becomes unimodal - The final distribution approaches Gaussian, validating theoretical predictions

Experiment 2: Score Function Learning and Visualization

We learn the Score function of a simple two-dimensional distribution and visualize its gradient field.

Setup: Target distribution has a "double moon" shape:whereis an energy function defining the double moon shape.

Network architecture: Use a simple MLP to learn the Score function.

Training: Use denoising Score Matching loss. See the code implementation for details.

Results analysis: - The learned Score function is visually highly consistent with the true Score function - Score vectors point in the direction of increasing probability density - In low-probability regions, Score vectors have larger magnitudes, pushing samples toward high-probability regions

Experiment 3: Comparison of Different SDE/ODE Samplers

We compare the effectiveness of different numerical methods for solving reverse diffusion SDEs/ODEs.

Setup: Use the Score network trained in Experiment 2, sampling from a standard Gaussian prior.

Methods compared: 1. Euler-Maruyama (SDE, first-order) 2. Heun's method (ODE, second-order) 3. Runge-Kutta 4 (ODE, fourth-order)

Results analysis: - Euler-Maruyama: Best randomness but requires more steps - Heun/RK4: Deterministic sampling, similar quality, RK4 slightly better - Sampling quality: All methods generate reasonable samples, validating the effectiveness of Score functions

Experiment 4: PDE-Constrained Conditional Generation

We implement a simple PDE-constrained conditional generation task: given boundary conditions, generate samples satisfying the PDE.

Setup: Consider the Poisson equation:with boundary condition.

Method: Use diffusion models to generate samples satisfying the PDE.

Results analysis: - Conditional generation models can generate samples based on given conditions (such as boundary values) - Generated samples statistically satisfy PDE constraints - This demonstrates the potential of diffusion models in scientific computing applications

Summary and Outlook

This article systematically establishes the PDE theoretical framework for diffusion models. Starting from classical heat equations, we demonstrated the mathematical essence of diffusion processes; introduced stochastic differential equations and the Fokker-Planck equation, revealing the laws of probability density evolution; focused on Score-Based generative models, establishing connections between Score function learning and Langevin dynamics sampling; delved into DDPM and DDIM, showing their essence as discretization schemes for SDEs/ODEs; and finally validated theoretical predictions through four complete experiments.

Key insights: 1. Diffusion models are PDE solvers: Forward diffusion corresponds to the Fokker-Planck equation, reverse sampling corresponds to reverse SDEs or probability flow ODEs 2. Score function is central: Learning Score functions is equivalent to learning gradients of probability densities, avoiding computation of normalization constants 3. Choice of discretization schemes: SDE sampling has better randomness but is slower, ODE sampling is deterministic and faster, hybrid methods balance both 4. Conditional generation extension: PDE constraints can be naturally incorporated into the diffusion model framework, opening new paths for scientific computing applications

Future directions: - More efficient sampling algorithms: Fast sampling methods based on PDE theory - Conditional generation theory: Score Matching theory under PDE constraints - Multi-scale diffusion: Combining multi-resolution PDE solving techniques - Application expansion: Physical simulation, inverse problem solving, scientific discovery

The PDE nature of diffusion models not only provides profound theoretical insights but also points the way for future algorithm design and application expansion. As PDE theory and deep learning further integrate, we can expect to see more breakthrough progress.

References

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2020). Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456. arXiv:2011.13456
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840-6851. arXiv:2006.11239
Song, J., Meng, C., & Ermon, S. (2021). Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502. arXiv:2010.02502
Anderson, B. D. (1982). Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3), 313-326. DOI:10.1016/0304-4149(82)90051-5
Hyv ä rinen, A., & Dayan, P. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 695-709.
Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural Computation, 23(7), 1661-1674.
Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32. arXiv:1907.05600
Song, Y., & Ermon, S. (2020). Improved techniques for training score-based generative models. Advances in Neural Information Processing Systems, 33, 12438-12448. arXiv:2006.09011
Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35, 26565-26577. arXiv:2206.00364
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2022). DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35, 5775-5787. arXiv:2206.00927
Dockhorn, T., Vahdat, A., & Kreis, K. (2022). Score-based generative modeling with score-matching objectives. Advances in Neural Information Processing Systems, 35, 35289-35304.
Chung, H., Kim, J., Mccann, M. T., Klasky, M. L., & Ye, J. C. (2023). Diffusion posterior sampling for general noisy inverse problems. arXiv preprint arXiv:2209.14687. arXiv:2209.14687
Song, Y., Durkan, C., Murray, I., & Ermon, S. (2021). Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34, 1415-1428. arXiv:2101.09258
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684-10695. arXiv:2112.10752
Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., ... & Norouzi, M. (2022). Palette: Image-to-image diffusion models. ACM SIGGRAPH 2022 Conference Proceedings. arXiv:2111.05826