Stochastic PDEs

Published

July 26, 2025

This is written as a learning exercise using multiple interactions with claude and gemini. Though I have served as a human-in-the-loop for the mathematical definitions, it is possible that I might have gotten some of it wrong as well.

The Physical Origins of Stochastic Dynamics: The Langevin Equation

The mathematical theory of stochastic processes, while abstract, has deep roots in the physical world. Its origins can be traced to the effort to understand and model systems where deterministic laws are insufficient to capture observed behavior. The quintessential example is Brownian motion, and the Langevin equation stands as the first and most intuitive physical model to bridge the gap between microscopic randomness and macroscopic, observable dynamics.

The Phenomenon of Brownian Motion

In 1827, botanist Robert Brown observed the incessant, erratic motion of pollen grains suspended in water. For decades, this phenomenon remained a curiosity until Albert Einstein, in his 1905 annus mirabilis, provided a quantitative theoretical explanation [1]. Einstein’s analysis shifted the focus from the impossible task of tracking a single particle’s trajectory to describing its statistical properties. He predicted that the average squared distance a particle travels from its origin, known as the Mean Squared Displacement (MSD), should grow linearly with time [1]. This relationship, expressed as \(\langle x(t)^2 \rangle = 2Dt\), where \(D\) is the diffusion coefficient, became the statistical hallmark of a diffusive process [1]. This macroscopic law demanded a microscopic explanation that could account for both the particle’s gradual slowing and its persistent, random jiggling.

A Physicist’s Model: Decomposing Forces

Several years after Einstein’s work, French physicist Paul Langevin offered an alternative, more direct approach rooted in classical mechanics [2]. He started with Newton’s second law, \(m\frac{d^2x}{dt^2} = F_{\text{total}}\), and proposed a crucial conceptual leap: the total force exerted by the fluid on the much larger Brownian particle could be decomposed into two distinct components [3].

A Systematic, Deterministic Drag Force

This force, \(F_{\text{drag}} = -\gamma v\), is proportional to the particle’s velocity \(v = \frac{dx}{dt}\). It represents the macroscopic effect of viscosity, a dissipative force that continuously removes kinetic energy from the particle and resists its motion through the fluid [2]. The constant \(\gamma\) is the friction coefficient.

A Rapidly Fluctuating, Stochastic Force

This force, denoted \(f(t)\), represents the net effect of the immense number of random, high-frequency collisions between the particle and the individual molecules of the surrounding fluid. While each collision is a deterministic event, their collective impact is modeled as a random process that injects energy into the particle, causing its erratic movement [1].

This decomposition is a foundational act of physical modeling. It separates the system’s degrees of freedom into a slow, macroscopic variable of interest (the particle’s velocity) and a vast number of fast, microscopic variables (the fluid molecules’ states) whose collective influence is treated statistically [4]. This separation of timescales is only valid because the particle is much larger and more massive than the fluid molecules, causing its velocity to change slowly compared to the timescale of individual molecular collisions [4].

This modeling philosophy, a form of “structured ignorance,” is remarkably modern. Instead of attempting to model an intractably complex system in full detail, it strategically coarse-grains the microscopic dynamics into a statistical term. This same principle underpins contemporary generative models in AI, where the goal is not to parameterize the entire complex manifold of data but to define a simple stochastic process that can navigate to and from it.

The Langevin Equation

By combining these two forces, Langevin formulated his eponymous equation, a first-order differential equation for the velocity \(v\) that contains a stochastic term:

\[m\frac{dv}{dt} = -\gamma v + f(t)\]

This equation is the archetype of a Stochastic Differential Equation (SDE) [5]. It describes the evolution of a system subject to both deterministic (dissipative) and fluctuating (random) forces.

Characterizing the Noise: The Fluctuation-Dissipation Theorem

To make the Langevin equation mathematically useful, the statistical properties of the random force \(f(t)\) must be specified. Based on physical reasoning, two key assumptions are made [5]:

Zero Mean: The average of the random force over many realizations is zero, \(\langle f(t) \rangle = 0\). This reflects the fact that collisions are equally likely to push the particle in any direction.
Time Uncorrelation: The forces at two different times, \(t_1\) and \(t_2\), are uncorrelated, expressed mathematically as \(\langle f(t_1)f(t_2) \rangle = g\delta(t_1 - t_2)\). The term \(\delta(t_1 - t_2)\) is the Dirac delta function, which is zero everywhere except when \(t_1 = t_2\). This idealization, known as white noise, implies that the memory of the collisions is infinitesimally short. This is a reasonable approximation on the timescale of the particle’s motion, which is much longer than the duration of a single molecular collision [4]. The constant \(g\) represents the strength of the noise.

A profound connection exists between the dissipative drag force and the fluctuating random force. Both originate from the same underlying microscopic interactions with the fluid molecules. For a system in thermal equilibrium at temperature \(T\), the rate of energy dissipation due to friction must be balanced, on average, by the rate of energy injection from random fluctuations [6]. This balance leads to the Fluctuation-Dissipation Theorem, a cornerstone of non-equilibrium statistical mechanics, which relates the noise strength \(g\) to the friction coefficient \(\gamma\) and the temperature \(T\) [6]:

\[g = 2\gamma k_B T\]

where \(k_B\) is the Boltzmann constant. This theorem is not just a mathematical convenience; it is a physical constraint that ensures the model is thermodynamically consistent and that the particle’s average kinetic energy will eventually settle to the value predicted by the equipartition theorem, \(\langle \frac{1}{2}mv^2 \rangle = \frac{1}{2}k_B T\) [6].

Overdamped Dynamics

In many physical and biological systems, particularly on long timescales or in highly viscous media, the particle’s acceleration is negligible. The inertial term \(m\frac{dv}{dt}\) becomes insignificant compared to the large frictional force \(-\gamma v\) [4]. In this overdamped limit, the Langevin equation simplifies significantly. If the particle is also subject to an external conservative force derived from a potential \(U(x)\), such that \(F_{\text{ext}} = -\nabla U(x)\), the equation becomes a first-order SDE for the position \(x\):

\[\gamma \frac{dx}{dt} = -\nabla U(x) + f(t)\]

This overdamped Langevin equation is a foundational model in chemical physics, molecular dynamics, and, as will be seen, serves as a direct inspiration for the dynamics used in modern generative models [5].

The Mathematical Framework of Stochastic Differential Equations

The Langevin equation provides a powerful physical intuition, but its reliance on the ill-defined “white noise” process necessitates a more rigorous mathematical foundation. This foundation is provided by the theory of stochastic calculus, developed primarily by Kiyosi Itô in the 1940s [7]. This framework replaces the problematic concept of white noise with the well-defined Wiener process and develops a new form of calculus to handle integration with respect to it.

From Physical Noise to a Mathematical Object: The Wiener Process

The mathematical formalization of the integral of white noise is the Wiener process, denoted \(W(t)\), also known as mathematical Brownian motion [8]. It is a continuous-time stochastic process defined by three fundamental properties [9]:

Initial Condition: \(W(0) = 0\) with probability one.
Independent Increments: For any sequence of times \(0 \leq s < t < u < v\), the increments \(W(t) - W(s)\) and \(W(v) - W(u)\) are independent random variables.
Gaussian Increments: The increment \(W(t) - W(s)\) is a Gaussian random variable with mean 0 and variance \(t - s\), i.e., \(W(t) - W(s) \sim \mathcal{N}(0, t - s)\).

From these properties, it can be shown that the sample paths of \(W(t)\) are continuous everywhere but differentiable nowhere [9]. This non-differentiability is the central reason why classical Riemann-Stieltjes integration and calculus are inadequate for dealing with stochastic processes driven by Brownian motion.

The Failure of Classical Calculus and the Rise of Itô Calculus

An ordinary differential equation (ODE) like \(\frac{dx}{dt} = b(x(t))\) is formally equivalent to the integral equation \(x(t) = x(0) + \int_0^t b(x(s))ds\) [10]. A general SDE is written in a similar differential form:

\[dX(t) = b(X(t), t)dt + B(X(t), t)dW(t)\]

Here, \(b(X(t), t)\) is the drift coefficient, representing the deterministic evolution, and \(B(X(t), t)\) is the diffusion coefficient, modulating the magnitude of the random fluctuations [9]. The equation is properly interpreted as the integral equation:

\[X(t) = X(0) + \int_0^t b(X(s), s)ds + \int_0^t B(X(s), s)dW(s)\]

The first integral is a standard Riemann integral. The second, the stochastic integral, is problematic. Due to the unbounded variation of \(W(t)\), the value of a Riemann-Stieltjes-type sum depends on the choice of evaluation points within each subinterval.

Itô’s key contribution was to provide a consistent definition for this integral. The Itô integral is defined as the mean-square limit of Riemann sums where the integrand is always evaluated at the left endpoint of each time subinterval [9]. This specific choice gives the resulting integral the crucial property of being a martingale, which loosely means its expected future value is its current value. This property is essential in many applications, particularly in mathematical finance.

An alternative definition, the Stratonovich integral, uses the midpoint of the subinterval [7]. The Stratonovich formulation has the advantage of obeying the standard chain rule of calculus, making it more intuitive for physicists who often view SDEs as the limit of physical processes with a small but non-zero noise correlation time [7]. For the important case of additive noise, where the diffusion coefficient \(B\) is not a function of the state \(X\), the Itô and Stratonovich interpretations are equivalent [7].

The Cornerstone of Stochastic Calculus: Itô’s Lemma

The most important operational tool in stochastic calculus is Itô’s Lemma, which is the stochastic analogue of the chain rule [9]. If \(Y(t) = u(t, X(t))\) is a function of time and the stochastic process \(X(t)\), its differential \(dY(t)\) is not what one would expect from classical calculus. It contains an additional second-order term that arises because the quadratic variation of the Wiener process is non-zero; informally, \((dW_t)^2\) is of the order \(dt\), whereas \((dt)^2\) and \(dt \cdot dW_t\) are of higher order and vanish in the limit [9].

For a process \(dX_t = b(t, X_t)dt + B(t, X_t)dW_t\), Itô’s Lemma states:

\[du(t, X_t) = \left(\frac{\partial u}{\partial t} + b\frac{\partial u}{\partial x} + \frac{1}{2}B^2\frac{\partial^2 u}{\partial x^2}\right)dt + B\frac{\partial u}{\partial x}dW_t\]

The presence of the second derivative term \(\frac{\partial^2 u}{\partial x^2}\) is the fundamental departure from ordinary calculus and has profound consequences. For example, it is used to solve the SDE for geometric Brownian motion, \(dP_t = \mu P_t dt + \sigma P_t dW_t\), a standard model for stock prices [9]. Applying Itô’s Lemma to \(u(P) = \log(P)\) yields the solution \(P_t = P_0 \exp((\mu - \frac{1}{2}\sigma^2)t + \sigma W_t)\), where the \(-\frac{1}{2}\sigma^2\) term, known as the “Itô correction,” arises directly from the second-derivative term in the lemma [11].

The Macroscopic View: The Fokker-Planck Equation

The SDE describes the evolution of individual sample paths or trajectories of a system. However, in many scientific contexts, the primary interest lies in the evolution of the probability distribution of the system’s state, \(P(x, t)\) [7]. The Fokker-Planck Equation (FPE) is a deterministic partial differential equation (PDE) that describes how this probability density function evolves over time [12, 13].

For an SDE given by \(dX_t = b(X_t)dt + B(X_t)dW_t\), the corresponding FPE for the probability density \(P(x, t)\) is:

\[\frac{\partial P(x, t)}{\partial t} = -\frac{\partial}{\partial x}[b(x)P(x, t)] + \frac{1}{2}\frac{\partial^2}{\partial x^2}[B^2(x)P(x, t)]\]

The FPE can be derived from the SDE using the Chapman-Kolmogorov equation, which describes the evolution of probabilities for a Markov process [12]. The drift term \(b(x)\) of the SDE gives rise to a convection (or drift) term in the FPE, while the diffusion term \(B(x)\) gives rise to a diffusion term. This equation provides a deterministic description for the evolution of the entire ensemble of possible trajectories. For physical systems in contact with a thermal bath, the stationary solution of the FPE (where \(\frac{\partial P}{\partial t} = 0\)) must correspond to the equilibrium Boltzmann distribution, \(P_{\text{eq}}(x) \propto \exp(-U(x)/k_B T)\), which provides a powerful consistency check on the model [4].

The relationship between the SDE and the FPE represents a powerful duality. The SDE offers a microscopic, Lagrangian, or particle-based perspective, describing the trajectory of a single realization. The FPE offers a macroscopic, Eulerian, or continuum perspective, describing the evolution of the probability density of the entire ensemble. This duality is not merely a mathematical convenience; it is the conceptual engine that drives the application of SDEs in generative modeling. Diffusion models, for instance, operate by defining a forward SDE that acts on individual data points (the path-wise view). This process is then used to learn a score function, a property of the evolving probability distribution. This learned distributional property is then used to define a reverse SDE that guides the generation of new samples, effectively shaping the entire output distribution. The SDE framework thus provides the tools to computationally bridge the path-wise and distributional descriptions of a complex system.

A Glimpse into Infinite Dimensions: Stochastic Partial Differential Equations (SPDEs)

The SDE framework can be extended to describe systems that have spatial extent, such as a vibrating string or a temperature field, where the state \(u(x, t)\) is a function of both space \(x\) and time \(t\). The resulting equations are known as Stochastic Partial Differential Equations (SPDEs) [14]. A canonical example is the stochastic heat equation, which models the temperature of a rod subject to random thermal fluctuations along its length [15].

SPDEs introduce significant new mathematical challenges. The driving noise is typically a “space-time white noise,” which is singular in both space and time. Consequently, solutions to SPDEs are often much “rougher” than their SDE counterparts. For instance, the solution to the one-dimensional stochastic heat equation is typically Hölder-continuous with an exponent of nearly 1/4 in time, compared to the nearly 1/2 Hölder continuity of SDE solutions [16]. While a rich and active area of research, the theory of SPDEs is considerably more complex and is a step beyond the foundational SDE framework required for the machine learning models discussed in this report.

Numerical Simulation of Stochastic Trajectories

With the exception of a few specific cases (primarily linear SDEs), most stochastic differential equations do not have analytical, closed-form solutions. Therefore, to study the behavior of systems described by SDEs, one must turn to numerical methods. These methods approximate the continuous-time solution with a discrete-time Markov chain that can be simulated on a computer [17].

The Need for Discretization

The core idea behind numerical SDE solvers is to discretize the time interval of interest, \([0, T]\), into a finite number of steps, \(0 = t_0 < t_1 < \cdots < t_N = T\), each of size \(\Delta t = t_{n+1} - t_n\). The algorithm then generates a sequence of values \(Y_0, Y_1, \ldots, Y_N\) where each \(Y_n\) is an approximation of the true solution \(X(t_n)\). The key challenge is to correctly approximate the stochastic increment of the Wiener process, \(\Delta W_n = W(t_{n+1}) - W(t_n)\). From the definition of the Wiener process, we know that these increments are independent and identically distributed Gaussian random variables with mean 0 and variance \(\Delta t\). Thus, in a simulation, we can generate them as \(\Delta W_n = \sqrt{\Delta t} Z_n\), where \(Z_n\) is a standard normal random variable, \(Z_n \sim \mathcal{N}(0, 1)\) [18].

The Euler-Maruyama Method

The Euler-Maruyama method is the most fundamental and widely used numerical scheme for SDEs. It is a direct extension of the forward Euler method for ODEs [19]. The derivation begins with the integral form of the SDE:

\[X(t_{n+1}) = X(t_n) + \int_{t_n}^{t_{n+1}} b(X(s), s)ds + \int_{t_n}^{t_{n+1}} B(X(s), s)dW(s)\]

The simplest approximation is to assume the integrands \(b\) and \(B\) are constant over the small interval \([t_n, t_{n+1}]\) and equal to their values at the start of the interval, \(t_n\). This yields the update rule [18]:

\[Y_{n+1} = Y_n + b(Y_n, t_n)\Delta t + B(Y_n, t_n)\Delta W_n\]

This scheme is simple to implement and forms the basis for many more advanced methods. It is the workhorse for simulating SDEs, especially in the context of diffusion models where it is often used to discretize the reverse-time SDE for sampling [20].

The Milstein Method: A Higher-Order Approach

The Euler-Maruyama method’s accuracy can be poor, particularly for SDEs with multiplicative noise, where the diffusion coefficient \(B\) depends on the state \(X\). The Milstein method provides a higher-order correction by including an additional term from the Itô-Taylor expansion of the solution [21]. This expansion is analogous to a standard Taylor series but is derived using Itô’s Lemma. The correction term accounts for the interaction between the diffusion coefficient and the noise process itself. For a scalar SDE, the Milstein update rule is [18]:

\[Y_{n+1} = Y_n + b(Y_n)\Delta t + B(Y_n)\Delta W_n + \frac{1}{2}B(Y_n)B'(Y_n)((\Delta W_n)^2 - \Delta t)\]

where \(B'(Y_n)\) is the derivative of the diffusion coefficient with respect to the state. The additional term significantly improves the accuracy of the pathwise approximation. Notably, if the noise is additive (\(B\) is a constant), then \(B' = 0\), and the Milstein method reduces exactly to the Euler-Maruyama method [18]. While powerful, the Milstein scheme can be complex to implement for systems of SDEs, as it requires simulating additional stochastic integrals known as Lévy areas [22].

Concepts of Convergence in a Stochastic World

Evaluating the accuracy of a numerical SDE solver is more nuanced than for ODEs. Because the solution is a random process, we cannot simply measure the error of a single trajectory. Instead, convergence is defined in a statistical sense, with two principal modes being most important:

Strong Convergence

This criterion measures the average error between the approximate and true trajectories at the final time \(T\). A scheme has a strong order of convergence \(\alpha\) if the mean absolute error is bounded by a constant times the step size to the power of \(\alpha\): \(\mathbb{E}[|X(T) - Y_N|] \leq C(\Delta t)^\alpha\). Strong convergence is important when the accuracy of individual sample paths is critical. The Euler-Maruyama method has a strong order of \(\alpha = 0.5\), while the Milstein method achieves a strong order of \(\alpha = 1.0\) [23].

Weak Convergence

This criterion measures the error in the expected value of functions of the solution. A scheme has a weak order of convergence \(\beta\) if, for a suitable class of test functions \(f\), the error in the expectation is bounded: \(|\mathbb{E}[f(X(T))] - \mathbb{E}[f(Y_N)]| \leq C(\Delta t)^\beta\). Weak convergence is sufficient for many applications where only statistical moments or ensemble averages are needed, such as in financial option pricing or statistical physics. Both the Euler-Maruyama and Milstein methods typically have a weak order of \(\beta = 1.0\) [23].

The fact that the strong convergence order for Euler-Maruyama is 0.5 is a direct consequence of the underlying stochastic calculus. The dominant error term in a single step comes from the stochastic part, \(B(Y_n)\Delta W_n\). Since \(\Delta W_n\) has a standard deviation of \(\sqrt{\Delta t}\), the error scales with \(\sqrt{\Delta t}\), which is slower than the \(\Delta t\) scaling of the drift term’s error [23]. The Milstein method achieves strong order 1.0 precisely because its correction term, \(\frac{1}{2}BB'((\Delta W_n)^2 - \Delta t)\), is specifically constructed to cancel this leading-order stochastic error term, demonstrating a deep link between the structure of Itô’s Lemma and the design of accurate numerical schemes [24].

The following table summarizes the key characteristics of these foundational numerical methods:

Feature	Euler-Maruyama Method	Milstein Method
Update Formula (Scalar)	\(Y_{n+1} = Y_n + b(Y_n)\Delta t + B(Y_n)\Delta W_n\)	\(Y_{n+1} = Y_n + b(Y_n)\Delta t + B(Y_n)\Delta W_n + \frac{1}{2}B(Y_n)B'(Y_n)((\Delta W_n)^2 - \Delta t)\)
Strong Order	\(O(\Delta t^{0.5})\)	\(O(\Delta t^{1.0})\)
Weak Order	\(O(\Delta t^{1.0})\)	\(O(\Delta t^{1.0})\)
Complexity	Low. Requires evaluation of \(b\) and \(B\).	Higher. Requires evaluation of \(b\), \(B\), and the derivative \(B'\). Can be complex for multi-dimensional systems.
Primary Use Case	Prototyping, SDEs with additive noise, or when weak convergence is the sole concern.	SDEs with multiplicative noise where pathwise accuracy (strong convergence) is important.
Key Insight	The simplest discretization, directly extending the deterministic Euler method.	Includes the first stochastic correction term from the Itô-Taylor expansion, significantly improving strong convergence for state-dependent noise.

Stochastic Differential Equations as a Foundation for Machine Learning Models

The mathematical framework of SDEs, originally developed for physics, provides a powerful and increasingly central language for modern machine learning. The connection is most profound in the areas of probabilistic modeling and Bayesian inference, particularly through Gaussian Processes and state-space models, where SDEs offer both a generative perspective and significant computational advantages.

Gaussian Processes (GPs): A Primer

A Gaussian Process is a non-parametric model that defines a probability distribution over functions [25]. It is a generalization of the multivariate Gaussian distribution to an infinite-dimensional function space. A GP is completely specified by two functions [25]:

Mean Function \(m(x)\): The expected value of the function at input \(x\), \(m(x) = \mathbb{E}[f(x)]\).
Covariance Function (or Kernel) \(k(x, x')\): The covariance between the function values at inputs \(x\) and \(x'\), \(k(x, x') = \mathbb{E}[(f(x) - m(x))(f(x') - m(x'))]\).

The kernel is the heart of a GP, as it encodes prior beliefs about the function’s properties. For example, a smooth kernel will lead to a distribution over smooth functions. A key property is that for any finite set of inputs \(\mathbf{X} = \{x_1, \ldots, x_n\}\), the corresponding function values \(\mathbf{f} = [f(x_1), \ldots, f(x_n)]^T\) are jointly Gaussian with mean \(\mathbf{m}\) and covariance matrix \(\mathbf{K}\), where \(K_{ij} = k(x_i, x_j)\) [26].

The Fundamental Link: Linear SDEs and GPs

A deep connection exists between GPs and linear SDEs: many commonly used GP kernels are precisely the covariance functions of the stationary solutions to linear time-invariant (LTI) SDEs driven by white noise [27].

The canonical example is the Ornstein-Uhlenbeck (OU) process, which describes the velocity of a Brownian particle. It is the solution to the SDE [28]:

\[\frac{df(t)}{dt} = -\lambda f(t) + w(t)\]

where \(w(t)\) is white noise. The stationary solution to this SDE is a Gaussian process with zero mean and an exponential covariance kernel:

\[k(\tau) = \sigma^2 \exp(-\lambda|\tau|)\]

where \(\tau = |t - t'|\). This process is both Gaussian and Markovian, meaning its future state depends only on its present state, not its entire history. This Markov property is a direct consequence of the SDE being first-order.

This relationship can be generalized: any GP with a rational spectral density (the Fourier transform of its covariance function) can be represented as the solution to an \(N\)th-order LTI SDE of the form [29]:

\[\frac{d^N f}{dt^N} + a_{N-1}\frac{d^{N-1} f}{dt^{N-1}} + \cdots + a_0 f = w(t)\]

This reframes the concept of a kernel. The standard GP view sees a kernel as a declarative measure of similarity between points. The SDE perspective provides a generative, causal mechanism: the kernel emerges as a statistical property of a dynamical system evolving through time. This shift is not merely philosophical; it is the key to unlocking massive computational efficiencies.

The Matérn Kernel and its SPDE Representation

The Matérn family of kernels is exceptionally popular in machine learning and spatial statistics because it includes a parameter, \(\nu\), that directly controls the smoothness (mean-square differentiability) of the resulting functions [30]. The Matérn covariance is given by:

\[k_\nu(d) = \frac{\sigma^2}{2^{\nu-1}\Gamma(\nu)}\left(\frac{\sqrt{2\nu} d}{\rho}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu} d}{\rho}\right)\]

where \(d = |x - x'|\), \(\rho\) is a length-scale parameter, \(\Gamma\) is the gamma function, and \(K_\nu\) is the modified Bessel function of the second kind [30]. As \(\nu \to \infty\), the Matérn kernel converges to the infinitely smooth squared exponential (RBF) kernel [31].

A landmark result by Whittle (1963) connects the Matérn kernel to an SPDE. A Gaussian process with a Matérn covariance function is the stationary solution to the SPDE [32]:

\[(\kappa^2 - \Delta)^{\alpha/2} f(x) = \mathcal{W}(x)\]

where \(\alpha = \nu + d/2\), \(\Delta\) is the Laplacian operator, and \(\mathcal{W}(x)\) is spatial white noise. This result provides a deep link between the smoothness parameter \(\nu\) of the kernel and the order of a differential operator.

Computational Advantage: The State-Space Representation

The true power of the SDE-GP connection comes from the ability to cast the model into a state-space representation, which enables highly efficient inference algorithms [33]. An \(N\)th-order SDE can be converted into a system of \(N\) first-order SDEs by defining a state vector that includes the function and its first \(N-1\) derivatives, \(\mathbf{x}(t) = [f(t), f'(t), \ldots, f^{(N-1)}(t)]^T\). This results in a linear vector SDE [26]:

\[d\mathbf{x}(t) = \mathbf{F}\mathbf{x}(t)dt + \mathbf{L}w(t)\]

The function value we care about is simply a linear observation of this state, \(y(t) = \mathbf{H}\mathbf{x}(t)\), where \(\mathbf{H} = [1, 0, \ldots, 0]\). This is a continuous-time linear dynamical system.

When this system is observed at a discrete set of times \(\{t_k\}\), its solution can be written as a discrete-time linear Gaussian state-space model [26]:

\[\begin{align} \mathbf{x}_k &= \mathbf{A}_{k-1} \mathbf{x}_{k-1} + \mathbf{q}_{k-1} \quad \text{(State Transition)} \\ y_k &= \mathbf{H} \mathbf{x}_k + r_k \quad \text{(Measurement)} \end{align}\]

where \(\mathbf{q}_{k-1}\) and \(r_k\) are Gaussian noise terms. This structure is precisely the form required by the Kalman filter and the Rauch-Tung-Striebel (RTS) smoother [34, 35]. These are recursive algorithms that process data sequentially. The Kalman filter propagates the mean and covariance of the state forward in time, and the RTS smoother refines these estimates by propagating information backward [26].

The computational complexity of these algorithms scales linearly with the number of data points, \(O(n)\). This is a dramatic improvement over standard GP regression, which requires constructing and inverting an \(n \times n\) covariance matrix, a process with \(O(n^3)\) complexity [36]. By reformulating the GP as the solution to an SDE, we impose a Markovian structure on a higher-dimensional state space, which is the key that enables this efficient, recursive inference.

The Generative Revolution: SDEs in Modern AI

In recent years, the SDE framework has moved from a tool for analysis to the core engine of a new class of powerful generative models. This paradigm shift, largely driven by the work on score-based generative models, has unified previous approaches and unlocked new capabilities in generating high-fidelity data such as images, audio, and more.

Score-Based Generative Modeling with SDEs

This framework, developed by Song et al., provides a continuous-time generalization of discrete-time diffusion models like DDPM [20, 37]. The process consists of two main parts:

The Forward Process (Data to Noise)

A forward SDE is defined to gradually transform a complex data distribution, \(p_0(x)\), into a simple, known prior distribution, \(\pi(x)\) (e.g., a standard Gaussian), over a time interval \([0, T]\). This SDE is of the general form:

\(dx = f(x, t)dt + g(t)dw\)

The drift \(f(x, t)\) and diffusion \(g(t)\) coefficients are pre-specified (not learned) and chosen to ensure that the distribution of \(x(T)\) is close to \(\pi(x)\). This process effectively destroys the information in the original data, leaving only noise [20].

The Reverse Process (Noise to Data)

The central insight is that this diffusion process can be reversed in time. A remarkable theorem from stochastic calculus states that the reverse trajectory is also the solution to an SDE, which runs from \(t = T\) back to \(t = 0\) [38]. The reverse SDE is given by:

\(dx = [f(x, t) - g(t)^2 \nabla_x \log p_t(x)]dt + g(t)d\bar{w}\)

where \(d\bar{w}\) is a standard Wiener process running backward in time.

This reverse SDE provides a path to generate data from noise. However, it depends on a critical, unknown quantity: \(\nabla_x \log p_t(x)\), which is the score function of the perturbed data distribution \(p_t(x)\) at time \(t\) [38]. The score function points in the direction of increasing data density and contains all the information necessary to guide the noisy samples back toward the original data manifold.

The learning problem is thus reduced to estimating this time-dependent score function. A neural network, \(s_\theta(x, t)\), is trained to approximate the score using an objective called denoising score matching [39]. Once the network is trained, it is plugged into the reverse SDE, which can then be simulated using numerical solvers like the Euler-Maruyama method to generate new samples by starting with a draw from the prior \(x(T) \sim \pi(x)\) and integrating backward to \(t = 0\) [20].

The Probability Flow ODE

For any SDE, there exists a corresponding deterministic Ordinary Differential Equation (ODE), known as the probability flow ODE, whose trajectories transport the probability density in exactly the same way as the SDE [38]. For the generative process, this ODE is:

\(\frac{dx}{dt} = f(x, t) - \frac{1}{2}g(t)^2 \nabla_x \log p_t(x)\)

This deterministic formulation offers several key advantages [38]:

Efficient Sampling: It can be solved with highly efficient, adaptive-step-size numerical ODE solvers, often leading to faster and more accurate sample generation than simulating the SDE.
Exact Likelihood Computation: The ODE defines an invertible mapping between data and noise, allowing it to be treated as a continuous normalizing flow [40]. This enables the exact computation of the data log-likelihood, a feature often missing in other generative models like GANs.
Latent Representation: It provides a unique, deterministic encoding of any data point into a latent representation in the noise space, which can be used for data manipulation and analysis.

Continuous-Time VAEs with Latent SDEs

The SDE framework has also been integrated into Variational Autoencoders (VAEs) to create powerful generative models for sequential and time-series data [41]. In this architecture, the latent variable is not a static vector but a continuous-time trajectory governed by an SDE.

Generative Model (Prior)

The prior distribution over the latent path \(z(t)\) is defined by a neural SDE: \(dz = f_\theta(z, t)dt + g_\theta(z, t)dw\). A decoder network then maps the latent path \(z(t)\) to the observed data \(x(t)\) [42].

Inference Model (Posterior)

An approximate posterior distribution over latent paths, \(q_\phi(z(t)|x)\), is defined by a second neural SDE, whose drift and diffusion coefficients can be conditioned on the observed data sequence \(x\) [43].

Training

The model is trained by maximizing a continuous-time version of the Evidence Lower Bound (ELBO). This objective includes a reconstruction term and a KL divergence term between the path measures of the prior and posterior SDEs [41].

This latent SDE framework is particularly well-suited for modeling irregularly-sampled and sparse time-series data, as the latent dynamics are defined continuously, allowing for principled inference and generation at any point in time [44].

Across these advanced models, the score function emerges as a unifying concept. In statistical physics, it is related to the forces acting on particles in a potential field. In score-based generative models, it is the learned vector field that reverses diffusion. In variational inference, minimizing the KL divergence is intrinsically related to matching the score of the approximate posterior to that of the true posterior. The success of these modern generative methods is a testament to the power of combining the dynamical scaffolding provided by SDEs with the expressive power of deep neural networks to learn the fundamental, data-dependent score function.

Conclusions

The theory of stochastic differential equations provides a remarkably versatile and powerful mathematical language that bridges disparate scientific domains. Originating from the need to describe physical phenomena like Brownian motion, the Langevin equation established a paradigm for modeling complex systems by separating deterministic, dissipative forces from stochastic, fluctuating ones. This physical intuition was formalized by the rigorous mathematics of Itô calculus, which introduced the Wiener process and Itô’s Lemma to handle the non-standard properties of stochastic dynamics.

The resulting framework offers a dual perspective: the path-wise view of the SDE, which describes individual trajectories, and the distributional view of the Fokker-Planck equation, which governs the evolution of the entire ensemble probability.

This duality, combined with the development of robust numerical methods like the Euler-Maruyama and Milstein schemes, has laid the groundwork for the application of SDEs in machine learning. The profound connection between linear SDEs and Gaussian Processes allows for a generative, causal interpretation of covariance kernels, leading to highly efficient state-space models and inference algorithms like the Kalman filter.

Most recently, this framework has catalyzed a revolution in generative AI. Score-based models leverage the time-reversal properties of SDEs to transform noise into data, reducing the complex task of generative modeling to the more tractable problem of learning the score function of the data distribution as it is progressively perturbed by noise. This approach, along with its deterministic counterpart, the probability flow ODE, has achieved state-of-the-art results in data synthesis and offers new capabilities for likelihood estimation and controllable generation.

Furthermore, the integration of SDEs into the latent space of VAEs has created a new class of continuous-time models adept at handling complex, irregularly-sampled time-series data. From the random walk of a pollen grain to the generation of photorealistic images, the mathematical principles of stochastic differential equations provide a unifying thread, demonstrating their enduring relevance and power as a foundational tool for modeling the uncertain world.

References

The references are contained in the accompanying sde_references.bib file. When you render this document with Quarto, the bibliography will be automatically generated at the end.

References

Einstein A (1905) Über die von der molekularkinetischen theorie der wärme geforderte bewegung von in ruhenden flüssigkeiten suspendierten teilchen. Annalen der Physik 17(8):549–560

Langevin P (1908) Sur la théorie du mouvement brownien. Comptes Rendus de l’Académie des Sciences 146:530–533

Gardiner C (2009) Stochastic methods: A handbook for the natural and social sciences. Springer Science & Business Media

Øksendal B (2003) Stochastic differential equations: An introduction with applications. Springer Science & Business Media

Kubo R (1966) Fluctuation-dissipation theorem. Reports on Progress in Physics 29(1):255

Itô K (1944) Stochastic integral. Proceedings of the Imperial Academy 20(8):519–524

Wiener N (1923) Differential space. Journal of Mathematics and Physics 2(1-4):131–174

Karatzas I, Shreve S (1991) Brownian motion and stochastic calculus. Springer Science & Business Media

10.

Protter PE (2005) Stochastic integration and differential equations. Springer Science & Business Media

11.

Merton RC (1973) Theory of rational option pricing. The Bell Journal of Economics and Management Science 141–183

12.

Fokker AD (1914) Die mittlere energie rotierender elektrischer dipole im strahlungsfeld. Annalen der Physik 43(5):810–820

13.

Planck M (1917) Über einen satz der statistischen dynamik und seine erweiterung in der quantentheorie. Sitzungsberichte der Preussischen Akademie der Wissenschaften zu Berlin 324–341

14.

Da Prato G, Zabczyk J (2014) Stochastic equations in infinite dimensions. Cambridge University Press

15.

Bertini L, Cancrini N (1995) The stochastic heat equation: Feynman-kac formula and intermittence. Journal of Statistical Physics 78(5-6):1377–1401

16.

Hairer M, Mattingly JC (2008) Regularization by noise and stochastic burgers equations. Communications in Mathematical Physics 282(1):1–30

17.

Kloeden PE, Platen E (1992) Numerical solution of stochastic differential equations. Springer Science & Business Media

18.

Higham DJ (2001) An algorithmic introduction to numerical simulation of stochastic differential equations. SIAM

19.

Maruyama G (1955) Approximate integration of stochastic differential equations. Rendiconti del Circolo Matematico di Palermo 4(1):48–50

20.

Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B (2020) Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:201113456

21.

Milstein GN (1975) Approximate integration of stochastic differential equations. Theory of Probability & Its Applications 19(3):557–562

22.

Itô K (1951) Multiple wiener-itô integrals. Journal of the Mathematical Society of Japan 3(1):157–169

23.

Talay D, Tubaro L (1990) Convergence of numerical schemes for stochastic differential equations. Stochastic Analysis and Applications 8(1):94–123

24.

Milstein G (1986) Weak approximation of solutions of systems of stochastic differential equations. Theory of Probability & Its Applications 30(4):750–766

25.

Rasmussen CE, Williams CK (2006) Gaussian processes for machine learning. MIT Press

26.

Särkkä S (2013) Bayesian filtering and smoothing. Cambridge University Press

27.

Tronarp F, Kersting H, Särkkä S, Hennig P (2019) Stochastic differential equations as gaussian processes. Proceedings of the 36th International Conference on Machine Learning 6353–6363

28.

Uhlenbeck GE, Ornstein LS (1930) On the theory of the brownian motion. Physical Review 36(5):823

29.

Cockayne J, Oates CJ, Sullivan TJ, Girolami M (2019) Linear operators and stochastic partial differential equations as gaussian processes. Journal of Machine Learning Research 20(1):1990–2058

30.

Stein ML (1999) Matérn covariance functions. Interpolation of spatial data: some theory for kriging 31–48

31.

Duvenaud D (2014) The squared exponential kernel. Automatic model construction with Gaussian processes 9–16

32.

Whittle P (1963) Stochastic processes in several dimensions. Bulletin of the International Statistical Institute 40(2):974–994

33.

Hartikainen J, Särkkä S (2010) Kalman filtering and smoothing solutions to temporal gaussian process regression models. Proceedings of the 2010 IEEE International Workshop on Machine Learning for Signal Processing 379–384

34.

Kalman RE (1960) A new approach to linear filtering and prediction problems. Journal of Basic Engineering 82(1):35–45

35.

Rauch HE, Tung F, Striebel CT (1965) Maximum likelihood estimates of linear dynamic systems. AIAA Journal 3(8):1445–1450

36.

Lindgren F, Rue H, Lindström J (2011) Efficient gaussian process inference for short-scale spatial variation. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73(2):183–204

37.

Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33:6840–6851

38.

Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B (2021) Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations

39.

Hyvärinen A (2005) Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research 6(4)

40.

Papamakarios G, Nalisnick E, Rezende DJ, Mohamed S, Lakshminarayanan B (2021) Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research 22(57):1–64

41.

Li X, Wong T-KL, Chen RTQ, Duvenaud D (2020) Latent stochastic differential equations: Learning the generative model from data. International Conference on Learning Representations

42.

Tzen B, Raginsky M (2019) Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit. arXiv preprint arXiv:190509883

43.

Solin A, Särkkä S (2014) Scalable variational gaussian processes via harmonic kernel decomposition. International Conference on Machine Learning 3088–3096

44.

Rubanova Y, Chen RTQ, Duvenaud DK (2019) Latent ordinary differential equations for irregularly-sampled time series. Advances in Neural Information Processing Systems 32