2412.00036v1
2412.00036v1
2412.00036v1
Baruch College
One Bernard Baruch Way
New York, NY 10010
USA
December 3, 2024
Abstract
We propose a highly efficient and accurate methodology for generating synthetic financial
market data using a diffusion model approach. The synthetic data produced by our methodology
align closely with observed market data in several key aspects: (i) they pass the two-sample
Cramer – von Mises test for portfolios of assets, and (ii) Q - Q plots demonstrate consistency
across quantiles, including in the tails, between observed and generated market data. Moreover,
the covariance matrices derived from a large set of synthetic market data exhibit significantly
lower condition numbers compared to the estimated covariance matrices of the observed data.
This property makes them suitable for use as regularized versions of the latter. For model training,
we develop an efficient and fast algorithm based on numerical integration rather than Monte Carlo
simulations. The methodology is tested on a large set of equity data.
1 Introduction
In this paper, we present an efficient methodology for generating synthetic financial market data,
based on the diffusion model approach. Diffusion models [19], [20], [6], [21], [22], a class of deep
generative models, are mathematical models designed to generate synthetic data by Monte Carlo sim-
ulating a reverse-time stochastic process, which is specified as an Ito stochastic differential equation
(diffusion process). The diffusion model strategy to synthetic data generation and is a two stage pro-
cess: encoding and decoding. This process employs the use of linear stochastic differential equations.
These models have demonstrated impressive results across various applications, including com-
puter vision, natural language processing, time series modeling, multimodal learning, waveform sig-
nal processing, robust learning, molecular graph modeling, materials design, and inverse problem
solving [25]. Despite their successes, certain aspects of diffusion models, particularly those related
to the learning mechanism, require further refinement and development. Ongoing research efforts
focus on addressing these performance-related challenges and enhancing the overall capabilities of
diffusion modeling methodologies.
The significance of synthetic data across various domains, including addressing challenges related
to data privacy, regulatory compliance, and fraud detection, is highlighted in [10] and the references
1
2 A. Lesniewski and G. Trigila
therein. In the context of the finance industry, the ability to generate high-quality synthetic market
data plays a crucial role in applications such as portfolio allocation, portfolio and enterprise risk
quantification, the design and back-testing of trading and market-making strategies, what-if analysis,
and more.
In financial practice, accurately estimating various statistics of asset returns is critical. A key
example is the covariance matrix, which plays a central role in quantitative metrics such as portfolio
asset allocation and value at risk. These methodologies often assume, either explicitly or implicitly,
that returns are drawn from a multidimensional Gaussian distribution or, at a minimum, an ellipti-
cal distribution. Various estimation techniques have been proposed, including: (i) standard sample
(or maximum likelihood) estimators, (ii) shrinkage estimators [15], (iii) factor-based estimators [4],
and others [3]. However, standard sample estimators often produce poorly conditioned covariance
matrices with unstable inverses. This issue is commonly attributed to two primary factors:
(i) First, the dimensionality of the problem poses challenges: a covariance matrix of dimension d
has d(d+1)/2 parameters, which are estimated from a time series of n observations. For typical
portfolio sizes, this leads to a severely underdetermined problem. The alternative methods listed
above address this issue by regularizing the sample estimator using exogenous mechanisms.
(ii) Second, it is often believed that the estimation process itself is somehow prone to “error maxi-
mization”, rendering it unreliable for practical use.
A natural approach to regularizing the estimated covariance matrix is to reliably generate a large
number of samples from the unknown, non-parametric probability distribution underlying the asset
returns. This would allow for the estimation of statistics, including the covariance matrix, with im-
proved conditioning properties.
Specifically, we consider a portfolio of d assets A1 , . . . , Ad (e.g. equities), whose prices are
observed at discrete times. By X 1 , . . . , X d we denote the returns on these assets. While the frequency
of price observations is irrelevant to our analysis, for clarity, we assume daily observations. Denote
the d-dimensional vectors of their returns1 at each time ti , i = 1, . . . , n, where t1 < t2 < . . . < tn ,
by X = {xi , i = 1, . . . , n}. Notably, we impose no specific assumptions on the distribution of these
returns. The proposed method of generating synthetic data consistent with X consists in the following
steps:
1. We view the market data X as a sample drawn from an unspecified probability distribution
P0 . To each data point in X random noise is added incrementally over a time interval until the
data is completely deformed into white (Gaussian) noise. This random noise is modeled as a
suitable diffusion process.
3. Next, we estimate the coefficient functions using the score matching method, which estimates
the gradient of the log probability of the data. This method learnes the model how to move
along the score in the direction that most rapidly increases the probability of the data. Score
matching involves minimizing a nonlinear functional expressed as a d-dimensional integral.
The standard approach for this optimization problem used in the literature (Monte Carlo simu-
lation) is inefficient due to the high dimensionality of the problem. Instead, we develop a rapid,
efficient method by reducing the dimensionality of the problem to a manageable size, where
numerical quadrature can be carried out. This completes the encoding part of the algorithm.
4. Once we determine the score function of the DSDE, we set up the corresponding reverse-time
SDE. This reverse-time SDE generates the same sample paths as the original (forward-time)
SDE starting at the terminal time.
5. Since the terminal probability is white noise, we easily generate samples from that distribution.
Using these samples as terminal t = 1 values, we generate samples from the reverse-time SDE
using the Euler-Maruyama numerical scheme. The t = 0 values of these samples represent
the desired synthetic market data. Importantly, there’s no limit to the number of synthetic data
points we can generate. This completes the decoding part of the algorithm.
6. To assess whether the market data of size n and the generated synthetic sample of size m come
for the same (unknown) probability distribution P0 , we perform a two-sample Cramer-von
Mises test at the desired significance level.
Earlier work on this topic includes several notable publications. In [14], a market generator
based on a restricted Boltzmann machine (RBM) was introduced. However, tests using observed
market data revealed significant deviations in the tails of the RBM-generated distributions compared
to the actual data. Subsequently, a generative adversarial network (GAN) methodology for generating
market scenarios was developed in [24], demonstrating improved performance in the tails, aligning
more closely with market data. The diffusion model generative methodology proposed in this work
further enhances tail consistency, producing synthetic market scenarios that align well with observed
market data.
From a mathematical perspective, the style of this paper is informal, avoiding formally stated
theorems and proofs. Standard notations are used throughout, with the following conventions for
vector operations:
1. We employ Dirac’s notation to express various linear algebra operations, such as the inner
product of two vectors ⟨x|y⟩, the projection operator |x⟩⟨x| on a vector, etc.
2. For two vectors x, y ∈ Rd , x = (x1 , . . . , xd ), y = (y 1 , . . . , y d ) their product xy is understood
as the Hadamard (componentwise) product, i.e. (xy)i = xi y i , for i = 1, . . . , d. Additionally,
For a vector x ∈ Rd and a function f : R → R, f (x) represents the vector whose components
are the results from applying f componentwise, i.e. f (x)i = f (xi ). Similarly, inequalities like
f (x) > 0 indicate that all components of f (x) are positive.
The paper is organized as follows. Section 2 reviews the methodology for model learning through
score matching. For completeness, Section 3 summarizes key results from the theory of stochastic
differential equations, which are applied later in the paper. Section 4 introduces the framework of
denoising stochastic differential equations. Section 5 constitutes the core technical contribution, pre-
senting an efficient algorithm for model learning via score matching. This method eliminates the need
4 A. Lesniewski and G. Trigila
for time-consuming Monte Carlo simulations, replacing them with precise and efficient analytical ap-
proximations. In Section 6, we define the concept of synthetic market scenarios generated through
diffusion modeling. Finally, Section 7 showcases a series of experiments using equity market data,
validating the effectiveness of the proposed methodology.
an expression that does not involve the determinant of C. The definition of the score can easily be
extended to the family of probability distribution p(t, x) parameterized by time t.
The probability distribution p(x) typically depends on a set of parameters θ, i.e. p(x) = p(x; θ).
The objective of model training is to learn these parameters from the available data. In the context
of the Gaussian distribution, this dependence is established by specifying the functional relationship
between the mean and covariance in terms of the parameters, i.e. µ = µ(θ), C = C(θ).
Model learning via score matching consists in minimizing the objective function
1
E ∥s(· ; θ) − s(·)∥2
L(θ) =
2 Z
(3)
1
= ∥s(x; θ) − s(x)∥2 p(x)dx,
2
where p(x) represents here the continuum of data, and s(x; θ) is the score of the parameterized PDF3 ,
Integrating by parts, one can rewrite this expression in the form [7]
Z
1
∇T s(x; θ) + ∥s(x; θ)∥2 p(x)dx + const,
L(θ) = (5)
2
3
Note that the score in the sense used here is different from the Fisher score defined as ∇θ p(x; θ).
Simulating financial market dynamics 5
This form of the score matching principle is particularly beneficial when used with Monte Carlo
simulations. We will see below that this approach is equally applicable to numerical quadrature
methods. The arguments x can be viewed as “corruptions” of the observations in the data set X by
perturbing the data by noise drawn from a suitable probability distribution. The problem of optimizing
(13) can be viewed as the problem of denoising these corrupted observations.
3 Diffusion models
The incorporation of noise into the system will be achieved by generating sample paths of a suitable
diffusion (Ito) process, which is characterized as denoising stochastic differential equations. This
method facilitates the simulation of noise effects within the model.
where
D(t, x) = σ(t, x)σ(t, x)T (18)
is the diffusion matrix.
Note that the probability density function p(t, x) of the process Xt is the solution to the forward
Kolmogorov equation:
∂ 1
p(t, x) = −∇Tx α(t, x)p(t, x) + ∇Tx ∇x D(t, x)p(t, x) ,
∂t 2 (19)
p(0, x) = δ(x − x0 ).
In the context of diffusion models, when linear SDEs are employed, the probability density function
p(t, x) indeed exists and is, in fact, Gaussian.
µ(t, X0 ) = E(Xt )
Rt Z t Rt (23)
α1 (s)ds α1 (s)ds
= x0 e 0 + α0 (u)e u du,
0
where δij denotes Kronecker’s delta. Hence the probability law of Xt is Gaussian, with density
where n(x; µ, C) is the PDF of the d-dimensional Gaussian distribution N (µ, C).
More generally, it follows from (22) that, when conditioned on Xu = y (where u ≤ t), we have
Rt Z t Rt Z t Rt
α (s)ds α (s)ds
Xt = ye u 1
+ α0 (v)e v 1
dv + σ(v)e u α1 (s)ds dWv . (26)
u u
and covariance
C(t|u)ij = Cov(Xt |Xu )ij
Z t Rt (29)
= δij σii (v)2 e2 v α1i (s)ds dv.
u
The multivariate SDE above is, in fact, a system of univariate SDEs. If the initial condition were
drawn from a product of independent univariate distributions, studying its solution would involve
solving d independent one-dimensional SDEs. However, in our case, the initial condition X0 is
drawn from an unknown, possibly complex distribution. As a result, we consider the entire system
(22) as an aggregate rather than a collection of individual univariate problems.
where
1
α(t, Xt ) = α(t, Xt ) − ∇ D(t, Xt )p(t, Xt ) . (32)
p(t, Xt )
Equation (31) is the reverse-time SDE corresponding to (14). This reverse-time diffusion process
mirrors the original forward-time process while running in reverse time.
It can be verified that the process W t exhibits the natural properties of a Brownian motion running
in reverse time, and is referred to as the reverse-time Brownian motion. The presence of drift in W t
is, of course, related to the fact that it is associated with a different measure than the one governing
Wt . Specifically, it is associated with the measure obtained from the original measure via Girsanov’s
theorem after the shift (32) of the drift term in (14).
In particular, if σ(t, x) = σ(t) is independent of the state variable, then the formulas stated above
can be expressed in terms of the score function s(t, x) of the probability distribution p(t, x) in a more
straightforward manner. Namely,
In the case of a linear SDE of the form (20), the corresponding reverse-time diffusion process
remains linear. This is a consequence of the explicit expression (2) for the true score function, ac-
cording to which s(t, Xt ) is linear in Xt . However, it should be emphasized that the the reverse-time
diffusion process used for scenario generation involves the estimated score function s(t, Xt ; θ), where
θ denotes a collection of trainable parameters. This function is trained (e.g., using neural networks)
on the data X , and it is only approximately linear.
In the following sections, we will focus on the VP SDE. However, for completeness, we also provide
brief discussions of the sub-VP and VE SDEs.
10 A. Lesniewski and G. Trigila
1 t
Z tp
1 t
Z Z
Xt = x0 exp − β(s)ds + β(v) exp − β(s)ds dWv . (38)
2 0 0 2 v
Thanks to the specific form of the drift and diffusion coefficients, we can express this equation more
conveniently. Specifically, we introduce a new time variable:
Z t
τ (t) = β(s)ds. (39)
0
Zu = Xτ −1 (u) , (40)
In other words, Zt represents the pullback of Xt under the time change defined by (39). Leveraging
the change of time formula for Brownian motion [17],
p
τ ′ (t) dWt = dWτ (t) , (41)
which asymptotically tends to the standard normal distribution, as t → 1, i.e. Xt becomes white
noise in this limit.
Explicitly, with the choice (36), the new time variables are given by
(
−bi log(1 − t), if a = 0,
τi (t) = bi 1
(47)
a (1−t)a − 1 , if a > 0.
The parameter a determines the rate at which τi (t) approaches infinity as t ↑ 1, with a = 0 corre-
sponding to the slowest rate.
1
q Rt
dXt = − β(t)Xt dt + β(t) 1 − e−2 0 β(s)ds dWt .
(50)
2
Rather than explicitly solving this cumbersome equation, we can directly invoke the time change
method.
Using the notation introduced earlier, and applying the change of time transformation (41), we
express this SDE in the form:
1 p
dZτ = − Zτ dτ + 1 − e−2τ dWτ . (51)
2
The solution for Zτ is given by:
Z τ p
−τ /2
Z τ = z0 e + 1 − e−2s e−(τ −s)/2 dWs . (52)
0
12 A. Lesniewski and G. Trigila
Consequently, Xt tends to pure white noise, as t → 1, albeit with a lower variance compared to that
of the VE SDE.
1 Rt q Rt
Xt + 1 − e−2 0 β(s)ds s(t, Xt ) dt + β(t) 1 − e−2 0 β(s)ds dW t ,
dXt = −β(t) (55)
2
where s(t, Xt ) is the (estimated) score function. After time change (39) this equation takes the form:
1
Zτ + 1 − e−2τ s(t−1 (τ ), Zτ ) dτ + dW τ .
dZτ = − (56)
2
Explicitly,
dZτ = dWτ , (63)
with Z0 = z0 (= x0 ), and
τ (t) = bat . (64)
Clearly, the probability density function of (58) is explicitly given by
1 1
Z Z
LSM (θ) = λ(t)∥s(t, x; θ) − log p(t, x)∥2 p(t, x) dx dt, (68)
2 0 Rd
where λ(t) > 0 is a suitably chosen weight function. Its purpose is to regularize the behavior of
∥s(t, x; θ) − log p(t, x)∥2 at small values of t. As explained in Section 2.2, this objective function is
equivalent to the DSM objective function [23]:
1 1
Z Z Z
DSM
L (θ) = λ(t)∥s(t, x; θ) − log p(t, x|0, x0 )∥2 p(t, x|0, x0 )p0 (x0 ) dx0 dx dt, (69)
2 0 Rd Rd
where p(t, x|0, x0 ) = p(t, x) is the transition probability density (46), and p0 (x0 ) is the (empirical)
distribution of the data set X (7).
where i = 1, . . . , d, and where a(u) is a smooth, positive activation function, such as the softplus
function,
a(u) = log(1 + eu ). (73)
The neuronal weights and biases θ = (w, b, c, d) have the following properties:
(i) w ∈ Math,d (R), where h is the number of neurons in the hidden layer,
(ii) b ∈ Rh ,
(iii) c ∈ Matd,h (R), and
(iv) d ∈ Rd .
Notice that we use upper and lover indices in accordance with our convention to label the components
of a single x with upper indices, while examples of x are labeled with lower indices. We refer to this
network as the noise conditional score network (NCSN).
Using (27) and (70) to compute the transition probability, we can rewrite this expression in the
form
1 1 x − µ(t, x0 ) 2
Z Z Z
DSM
L (θ) = λ(t) s(t, x; θ) + p(t, x|0, x0 )p0 (x0 ) dx0 dx dt
2 0 Rd Rd C(t)
n Z Z (74)
1 X 1 K(x; θ) + xi − µ(t, xi ) 2 i
= λ(t) ft (x)dx dt,
2n 0 Rd C(t)
i=1
where we have introduced the notation
fit (x) = n(x; µ(t, xi ), C(t)). (75)
We will now choose λ(t) to have the form:
λ(t) = λ0 (t)C(t)2 , (76)
where λ0 (t) is continuous for all t ≥ 0, and so
n Z Z
DSM 1 X 1
L (θ) = λ0 (t)∥K(x; θ) + xi − µ(t, xi )∥2 fit (x)dx dt. (77)
2n 0 Rd
i=1
Explicitly, in case of the VP and sub-VP denoising diffusions, this expression reads explicitly
n Z Z
1 X 1
L DSM
(θ) = λ0 (t)∥K(x; θ) + xi − xi e−τ (t) ∥2 fit (x)dx dt, (78)
2n 0 Rd
i=1
with the appropriate covariances defined in Sections 4.1 and 4.2, respectively. Analogous expression
for the VE denosing SDE reads
n Z Z
DSM 1 X 1
L (θ) = λ0 (t)∥K(x; θ) + xi ∥2 fit (x)dx dt, (79)
2n 0 Rd
i=1
with the covariance defined in Section 4.3.
Simulating financial market dynamics 15
where fit (x) is given by (75), and where wj , wk ∈ Rd denote the j-th and k-th rows of the matrix w,
respectively. Notice that, with probability one, wj and wk , are linearly independent for j ̸= k. In
order to organize the computation, we rewrite this expression as follows:
n d Z
DSM 1 XX 1
L (θ) = λ0 (t)(xki − µ(t, xi )k + dk )2 dt
2n 0
i=1 k=1
n h d
1 X X X jk 1
Z
+ c λ0 (t)(xki − µ(t, xi )k + dk )Ii1 (wj , bj , t) dt (80)
n 0
i=1 j=1 k=1
n h
1 X X ij ik 1
Z
+ c c λ0 (t)Ii2 (wj , wk , bj , bk , t) dt,
2n 0
i=1 j,k=1
where Z
Ii1 (w, b, t) a ⟨w|x⟩ + b fit (x)dx,
= (81)
Rd
and Z
Ii2 (w1 , w2 , b1 , b2 , t) a ⟨w1 |x⟩ + b1 a ⟨w2 |x⟩ + b2 fit (x)dx,
= (82)
Rd
for linearly independent w1 and w2 . Changing variables we express I1 and I2 as integrals with
respect to the standard d-dimensional Gaussian measure:
Z
i 1 p − 1 ⟨x|x⟩
I1 (w, b, t) = a ⟨w| C(t) x + µ(t, xi )⟩ + b e 2 dx, (83)
(2π)d/2 Rd
16 A. Lesniewski and G. Trigila
and
Z
1 p
Ii2 (w1 , w2 , b1 , b2 , t)
= a ⟨w1 | C(t) x + µ(t, xi )⟩ + b1
(2π)d/2 Rd (84)
1
× a ⟨w2 | C(t) x + µ(t, xi )⟩ + b2 e− 2 ⟨x|x⟩ dx,
p
respectively.
The d-dimensional integrals I1 are I2 are actually highly numerically tractable and can be reduced
to one- and two-dimensional Gaussian integrals, respectively. In order to see it, we set
1
E= w, (85)
∥w∥
as det(Γ) = 1. This integral can be efficiently evaluated using the Gauss-Hermite quadrature algo-
rithm,
XD p
i
I1 (w, b, t) ≈ wp a ∥w∥ 2C(t) zp + ⟨w|µ(t, xi )⟩ + b . (88)
p=1
Here zp , p = 1, . . . , D, are the zeros of the Hermite polynomial of degree D, and wp are the appropri-
ate weights. Even low values of D, such as D = 4, are sufficient to calculate the integral accurately.
√ 1 2
Note the presence of 2 above, arising from the weight function in the integral being e− 2 y rather
2
than e−y .
Similarly, let us define the normalized vectors
1
Ei = wi , (89)
∥wi ∥
for i = 1, 2. We use the Gram-Schmidt algorithm to orthonormalize these vectors, which yields
E1′ = E1 ,
1 (90)
E2′ = (E2 − ⟨E2 |E1 ⟩E1 ).
(1 − ⟨E2 |E1 ⟩2 )1/2
Let yi = ⟨Ei′ |x⟩, for i = 1, 2, and Γ = I − |E1′ ⟩⟨E1′ | − |E2′ ⟩⟨E2′ |. We also let
⟨w1 |w2 ⟩
c(w1 , w2 ) = , (91)
∥w1 ∥∥w2 ∥
Simulating financial market dynamics 17
and s
⟨w1 |w2 ⟩2
s(w1 , w2 ) = 1− (92)
∥w1 ∥2 ∥w2 ∥2
denote the cosine and (the absolute value of the) sine of the angle between w1 and w2 , respectively.
Then
Ii2 (w1 , w2 )
Z
1 p
= d/2
a ∥w1 ∥ C(t) y1 + ⟨w1 |µ(t, xi )⟩ + b1
(2π) Rd
1 2 2 1
× a ∥w2 ∥ C(t) (s(w1 , w2 )y2 + c(w1 , w2 )y1 ) + ⟨w2 |µ(t, xi )⟩ + b2 e− 2 (y1 +y2 )− 2 ⟨x|Γx⟩ d2 y dd−2 x
p
Z
1 p
= a ∥w1 ∥ C(t) y1 + ⟨w1 |µ(t, xi )⟩ + b1
2π R2
1 2 2
× a ∥w2 ∥ C(t) (s(w1 , w2 )y2 + c(w1 , w2 )y1 ) + ⟨w2 |µ(t, xi )⟩ + b2 e− 2 (y1 +y2 ) d2 y,
p
as det(Γ) = 1. This two-dimensional Gaussian integral can be efficiently evaluated using the Gauss-
Hermite quadrature algorithm,
X p
Ii2 (w1 , w2 ) ≈
wp wq a ∥w1 ∥ 2C(t) zp + ⟨w1 |µ(t, xi )⟩ + b1
1≤p,q≤D
p
× a ∥w2 ∥ 2C(t) (s(w1 , w2 )zq + c(w1 , w2 )zp ) + ⟨w2 |µ(t, xi )⟩ + b2 ,
with a low value of D.
Finally, as explained at the beginning of this section, we perform time integration using Simp-
son’s method. As a result of these calculations, we obtain an expression for (80) that is suitable for
implementation in TensorFlow, enabling fast and accurate computations.
6 Synthetic markets
The process of generating synthetic markets involves several tasks:
(i) Training of the fitted score function, as discussed in Section 5.
(ii) Encoding and decoding of the training (historically observed) data.
(iii) Testing whether the generate data follow the same probability distribution as the training data.
This step is crucial, as financial data are often difficult to visualize.
In this section we focus on tasks (ii) and (iii).
Reverse-time path generation. To simulate sample paths of the reverse-time SDE (35) starting at
Xterm , an appropriate score function s(t, x) is required. In practice, we utilize the fitted score function
(71), defining ŝ(t, x) = s(t, x, θ).
As with forward path generation, we divide the interval [0, 1] into K + 1 equal subintervals of
length δ = 1/K, denoting the division points as tj = j/K, where j = 0, . . . , K. Let X̂tj represent
the (approximate) value of Xt at t = tj . The procedure is as follows:
Note: The Brownian motion used for reverse-time path generation should be freshly simulated and
must not reuse the Brownian motion from forward path generation. Reusing it would merely recover
the initial point, without the advantage of introducing noise.
A. Train the fitted score function (71) on the dataset X , as explained in Section 5.
C. The set X̂ = {x̂1 , . . . , x̂m } is the desired set of m synthetic market scenarios.
Simulating financial market dynamics 19
It remains to verify that X̂ is drawn from the same distribution as the original market data X .
Multivariate tests for assessing the goodness of fit between two empirical distributions - see, for
example, [12] - are notoriously challenging and difficult to implement. In a forthcoming paper [16],
we propose a practical methodology that we believe is well-suited for situations like the one addressed
in this study.
In the meantime, we adapt the univariate two-sample Cramer-von Mises (CvM) criterion [2] to
1 d
our context. To this end, we consider a specified portfolio Pd of assets A , . . . , A by choosing their
weights g = (g1 , . . . , gd ), where gi ≥ 0, for all i, and i=1 gi = 1. Let
pi = ⟨g|xi ⟩, for i = 1, . . . , n,
(97)
q̂k = ⟨g|x̂k ⟩, for k = 1, . . . , m,
denote the returns on this portfolio under the historical and synthetic scenarios, respectively. Let
P = {p1 , . . . , pn } and Q̂ = {q̂1 , . . . , q̂m } denote the two datasets of portfolio returns.
The two-sample CvM criterion tests the hypothesis that the two samples P and Q̂ are drawn from
the same distribution. This is done using the test statistics TCvM [2], whose value depends of course
on the choice of weights g. The null hypothesis, which posits that the two samples come from the
same distribution (i.e. TCvM = 0), can be rejected at a desired level of confidence.
While no choice of the portfolio weights g is a priori preferred, numerical simulations [16] sug-
gest that an equally weighted portfolio g = (1/d, . . . , 1/d) often lies near the maximizer (i.e. the
worst case scenario) of TCvM . We use the corresponding p-value of the CvM test, denoted in the
following by pCvM , as a provisional metric for the significance level of the multivariate test assessing
whether the datasets X and X̂ are drawn from the same distribution.
7 Experiments
For our experiments, we downloaded daily closing prices from Bloomberg5 for 33 large U.S. eq-
uities over the five-year period from November 13, 2019, to November 13, 2024.6 . This period
notably encompasses the economic distress caused by the COVID-19 pandemic, which introduced
significant volatility to equity markets. Testing the model’s ability to accurately simulate under these
non-stationary conditions serves as a critical assessment of its validity.
We selected various data windows, each n = 256 days in length - approximately equivalent
to the number of trading days in a year. For each window, we trained the diffusion model on the
corresponding data and generated m = 1, 024 synthetic market scenarios.
The model was configured as follows: the VP DSDE was selected with instantaneous volatility
functions defined by (36), using parameters a = 0 and bi = 0.1 for all i. Gauss-Hermite integration
was performed with an order of D = 4, and Simpson integration was performed using 8 subintervals.
The function λ0 (t) in (76) was set to λ0 (t) = 1, and the number of hidden neurons was configured as
h = 16. Training was conducted using the Adam optimizer.
To evaluate the model’s performance for each run, we recorded the following metrics:
(i) The p-value pCvM of the CvM statistic of the equally weighted portfolio of all selected assets.
5
The Bloomberg closing prices are adjusted for stock splits but not for dividends. We believe this does not impact our
simulation methodology
6
AAPL, AXP, BA, CAT, COST, CSCO, CVX, DD, DIS, GE, GS, HD, IBM, INTC, JNJ, JPM, KO, LLY, MCD, MMM,
MRK, MSFT, NIKE, NVDA, PFE, PG, TRV, TSLA, UNH, V, VZ, WMT, XOM.
20 A. Lesniewski and G. Trigila
(ii) The condition number κhist of the covariance matrix of the synthetic market data compared to
that of the historical data, κsynth .
(iii) The Q-Q plot comparing the synthetic data to the historical market data for visual assessment.
Experiment 1. We use days 1 through 256 of the data set. Historically, this period covered the Covid
pandemic, highly elevated market volatility, and aggressive rate cuts by the Federal Reserve.
pCvM = 1.00,
κhist = 109.64, (98)
κsynth = 53.46.
Experiment 2. We use days 250 through 506 of the data set. This period was marked by historically
Simulating financial market dynamics 21
pCvM = 0.64,
κhist = 130.37, (99)
κsynth = 70.63.
Experiment 3. We use days 500 through 756 of the data set. This period was marked by rising
inflation, elevated market volatility, and interest rate hikes implemented by the Federal Reserve.
pCvM = 1.00,
κhist = 208.17, (100)
κsynth = 124.40.
22 A. Lesniewski and G. Trigila
Experiment 4. We use days 750 through 1006 of the data set. During this period, the Federal Reserve
concluded its rate-hiking program.
pCvM = 0.83,
κhist = 95.82, (101)
κsynth = 51.05.
Simulating financial market dynamics 23
Experiment 5. We use days 1000 through 1256 of the data set. This period was characterized by
occasional spikes in market volatility driven by the presidential election.
pCvM = 0.98,
κhist = 438.51, (102)
κsynth = 282.80.
24 A. Lesniewski and G. Trigila
Figure 10: Q-Q plot of the historical versus synthetic returns in Experiment 5.
Experiment 6. In our final experiment we use days 1 through 1248 of the data set, which covers the
entire five year period7 . In order to accomodate a larger data set, for this experiment we increased the
number of hidden neurons to h = 32.
pCvM = 0.99,
κhist = 130.56, (103)
κsynth = 85.03.
7
We use 1248 (rather than 1258) for convenience, as 1248 is divisible by 32, our batch size parameter.
Simulating financial market dynamics 25
Figure 11: Histograms of historical (left) and synthetic (right) returns in Experiment 6.
Figure 12: Q-Q plot of the historical versus synthetic returns in Experiment 6.
It is worth noting that the period used for our experiments was characterized by rapidly shifting
economic regimes, including market volatility from the pandemic, falling interest rates, subsequent
inflation, rising rates, and political uncertainty surrounding the presidential election. Despite these
challenges, the model consistently demonstrated strong performance across the different time win-
dows.
The model can generate an arbitrary number of synthetic data points without compromising their
tail behavior or the CvM p-value. Moreover, the condition number of the estimated covariance matrix
decreases as the sample size grows. For instance, in Experiment 1, where 16,384 synthetic scenarios
were generated, the covariance matrix condition number was κsynth = 47.33, compared to κsynth =
53.46 for 1,024 synthetic scenarios.
26 A. Lesniewski and G. Trigila
References
[1] Anderson, B. D. O.: Reverse-time diffusion equation models, Stochastic Processes and their
Applications, 12, 313 - 326 (1982).
[2] Anderson, T. W.: On the distribution of the two-sample Cramer-von Mises criterion, The Annals
of Mathematical Statistics, 1148 - 1159 (1962).
[3] Johansson, K., Ogut, N.G, Pelger, M., Schmelzer, T., and Stephen Boyd, S.: A Simple Method
for Predicting Covariance Matrices of Financial Returns, Foundations and Trends in Economet-
rics, 12, 324 – 407 (2023).
[4] Fan, J., Li, R., Zhang, CH, and Hui, Z.: Statistical Foundations of Data Science, CRC Press
(2020).
[5] Haussmann, U. G., and Pardoux, E.: Time Reversal of Diffusions, The Annals of Probability,
14, 1188 - 1205 (1986).
[6] Ho, J., Jain, A., and Abbeel, P.: Denoising diffusion probabilistic models, Advances in Neural
Information Processing Systems, 33, 6840 - 6851 (2020).
[7] Hyvärinen, A.: Estimation of non-normalized statistical modes by score matching, Journal of
Machine Learning Research, 6, 95 - 709 (2005).
[8] Ikeda, N., and Watanabe, S.: Stochastic Differential Equations and Diffusion Processes, North-
Holland (1989).
[9] Jolicoeur-Martineau, A., Li, K., Piche-Taillefer, R., Kachman, T., and Mitliagkas, I.: Gotta Go
Fast When Generating Data with Score-Based Models, arXiv:2105.14080v1 (2021).
[10] Jordon, J., Houssiau, F., Cherubin, G., Cohen, S. N, Szpruch, L., Bottarelli, M., Maple, C., and
Weller. A.: Synthetic Data - what, why and how?, arXiv:2205.03257 (2022).
[11] van Kampen, N. G.: Stochastic Processes in Physics and Chemistry, North Holland (1992).
[12] Kim, I., Balakrishnan, S., and Wasserman, L.: Robust Multivariate Nonparametric Tests via
Projection-Averaging, Annals of Statistics, 48, 3417 - 3441(2020).
[13] Kloeden, P. E., and Platen, E.: Numerical Solution of Stochastic Differential Equations, Springer
(1992).
[14] Kondratyev, A. and Schwarz, C.: The market generator, SSRN 3384948 (2020).
[15] Ledoit, O., and Wolf, M.: A well-conditioned estimator for large-dimensional covariance matri-
ces, Journal of Multivariate Analysis, 88, 365 – 411 (2004).
[16] Lesniewski, A., and Trigila, G.: Projected Cramer - von Mises test, in preparation.
[17] Oksendal, B.: Stochastic Differential Equations: An Introduction with Applications, Springer
(2014).
[19] Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S.: Deep Unsupervised
Learning using Nonequilibrium Thermodynamics (2015).
[20] Song, Y., and Ermon, S.: Generative Modeling by Estimating Gradients of the Data Distribution,
NeurIPS Proceedings, 32 (2019).
[21] Song, Y., and Ermon, S.: Improved Techniques for Training Score-Based Generative Models,
NeurIPS Proceedings, 33 (2020).
[22] Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar,. A., Ermon, S., and Poole, B.: Score-Based
Generative Modeling through Stochastic Differential Equations (2021).
[23] Vincent, P.: A connection between score matching and denoising autoencoders, Neural compu-
tation, 23, 1661 – 1674 (2011).
[24] Wiese, M., Knobloch, R., Korn, R., and Kretschmer, P.: Quant GANs:Deep Generation of
Financial Time Series, Quantitative Finance, 20 1419 - 1440(2020).
[25] Yang, L., Zhang,. Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W., Cui, B., and
Yang, M.-H.: Diffusion Models: A Comprehensive Survey of Methods and Applications,
arXiv:2209.00796v11 (2023).