NTK!!!

Journal of Computational Physics 449 (2022) 110768
Contents lists available at ScienceDirect
Journal of Computational Physics

www.elsevier.com/locate/jcp
When and why PINNs fail to train: A neural tangent kernel

perspective
Sifan Wang a , Xinling Yu a , Paris Perdikaris b,∗
a
Graduate Group in Applied Mathematics and Computational Science, University of Pennsylvania, Philadelphia, PA 19104, United States of
America
b
Department of Mechanichal Engineering and Applied Mechanics, University of Pennsylvania, Philadelphia, PA 19104, United States of
America
a r t i c l e i n f o a b s t r a c t
Article history: Physics-informed neural networks (PINNs) have lately received great attention thanks to
Available online 11 October 2021 their flexibility in tackling a wide range of forward and inverse problems involving partial
differential equations. However, despite their noticeable empirical success, little is known
Keywords:
about how such constrained neural networks behave during their training via gradient
Physics-informed neural networks
Spectral bias
descent. More importantly, even less is known about why such models sometimes fail to
Multi-task learning train at all. In this work, we aim to investigate these questions through the lens of the
Gradient descent Neural Tangent Kernel (NTK); a kernel that captures the behavior of fully-connected neural
Scientific machine learning networks in the infinite width limit during training via gradient descent. Specifically, we
derive the NTK of PINNs and prove that, under appropriate conditions, it converges to a
deterministic kernel that stays constant during training in the infinite-width limit. This
allows us to analyze the training dynamics of PINNs through the lens of their limiting
NTK and find a remarkable discrepancy in the convergence rate of the different loss
components contributing to the total training error. To address this fundamental pathology,
we propose a novel gradient descent algorithm that utilizes the eigenvalues of the NTK to
adaptively calibrate the convergence rate of the total training error. Finally, we perform a
series of numerical experiments to verify the correctness of our theory and the practical
effectiveness of the proposed algorithms. The data and code accompanying this manuscript
are publicly available at https://github.com/PredictiveIntelligenceLab/PINNsNTK.
© 2021 Elsevier Inc. All rights reserved.
1. Introduction
Thanks to the approximation capabilities of neural networks, physics-informed neural networks (PINNs) have already
led to a series of remarkable results across a range of problems in computational science and engineering, including fluids
mechanics [1–5], bio-engineering [6,7], meta-material design [8–10], free boundary problems [11], Bayesian networks and
uncertainty quantification [12–16], high-dimensional PDEs [17,18], stochastic differential equations [19], fractional differen-
tial equations [20,21], and beyond [22–25]. However, PINNs using fully connected architectures often fail to achieve stable
training and produce accurate predictions, especially when the underlying PDE solutions contain high-frequencies or multi-
scale features [26,13,27]. Recent work by Wang et al. [28] attributed this pathological behavior to multi-scale interactions
* Corresponding author.
E-mail addresses: sifanw@sas.upenn.edu (S. Wang), xlyu@sas.upenn.edu (X. Yu), pgp@seas.upenn.edu (P. Perdikaris).
https://doi.org/10.1016/j.jcp.2021.110768
0021-9991/© 2021 Elsevier Inc. All rights reserved.
S. Wang, X. Yu and P. Perdikaris Journal of Computational Physics 449 (2022) 110768
between different terms in the PINNs loss function, ultimately leading to stiffness in the gradient flow dynamics, which,
consequently, introduces stringent stability requirements on the learning rate. To mitigate this pathology, Wang et al. [28]
proposed an empirical learning-rate annealing scheme that utilizes the back-propagated gradient statistics during training
to adaptively assign importance weights to different terms in a PINNs loss function, with the goal of balancing the mag-
nitudes of the back-propagated gradients. Although this approach was demonstrated to produce significant and consistent
improvements in the trainability and accuracy of PINNs, the fundamental reasons behind the practical difficulties of training
fully-connected PINNs still remain unclear [26].
Parallel to the development of PINNs, recent investigations have shed light into the representation shortcomings and
training deficiencies of fully-connected neural networks. Specifically, it has been shown that conventional fully-connected
architectures – such as the ones typically used in PINNs – suffer from “spectral bias” and are incapable of learning functions
with high frequencies, both in theory and in practice [29–32]. These observations are rigorously grounded by the newly
developed neural tangent kernel theory [33,34] that, by exploring the connection between deep neural networks and kernel
regression methods, elucidates the training dynamics of deep learning models. Specifically, the original work of Jacot et
al. [33] proved that, at initialization, fully-connected networks are equivalent to Gaussian processes in the infinite-width
limit [35–37], while the evolution of a infinite-width network during training can also be described by another kernel, the
so-called Neural Tangent Kernel (NTK) [33]. Remarkably, this function-space viewpoint allows us then to rigorously analyze
the training convergence of deep neural networks by examining the spectral properties of their limiting NTK [33,34,32].
Drawing motivation from the aforementioned developments, this work sets sail into investigating the training dynamics
of PINNs. To this end, we rigorously study fully-connected PINNs models through the lens of their neural tangent kernel, and
produce novel insights into when and why such models can be effectively trained, or not. Specifically, our main contributions
can be summarized into the following points:
• We prove that fully-connected PINNs converge to Gaussian processes at the infinite width limit for linear PDEs.
• We derive the neural tangent kernel (NTK) of PINNs and prove that, under suitable assumptions, it converges to a
deterministic kernel and remains constant during training via gradient descent with an infinitesimally small learning
rate.
• We show how the convergence rate of the total training error of a PINNs model can be analyzed in terms of the
spectrum of its NTK at initialization.
• We show that fully-connected PINNs not only suffer from spectral bias, but also from a remarkable discrepancy of
convergence rate in the different components of their loss function.
• We propose a novel adaptive training strategy for resolving this pathological convergence behavior, and significantly
enhance the trainability and predictive accuracy of PINNs.
Taken together, these developments provide a novel path into analyzing the convergence of PINNs, and enable the design of
novel training algorithms that can significantly improve their trainability, accuracy and robustness.
This paper is organized as follows. In section 2, we present a brief overview of fully-connected neural networks and their
behavior in the infinite-width limit following the original formulation of Jacot et al. [33]. Next, we derive the NTK of PINNs
in a general setting and prove that, under suitable assumptions, it converges to a deterministic kernel and remains constant
during training via gradient descent with an infinitesimally small learning rate, see section 3, 4. Furthermore, in section 5
we analyze the training dynamics of PINNs, demonstrate that PINNs models suffer from spectral bias, and then propose a
novel algorithm to improve PINNs’ performance in practice. Finally we carry out a series of numerical experiments to verify
the developed NTK theory and validate the effectiveness of the proposed algorithm.
2. Infinitely wide neural networks
In this section, we revisit the definition of fully-connected neural networks and investigate their behavior under the
infinite-width limit. Let us start by formally defining the forward pass of a scalar valued fully-connected network with L
hidden layers, with the input and output dimensions denoted as d0 = d, and d L +1 = 1, respectively. For inputs x ∈ Rd we
also denote the input layer of the network as f (0) (x) = x for convenience. Then a fully-connected neural network with L
hidden layers is defined recursively as
1
g (h) (x) = √ W (h) · f (h) + b(h) ∈ Rdh+1 , (2.1)
dh
f (h+1) (x) = σ ( g (h) (x)), (2.2)
for h = 0, 1, . . . , L − 1, where W (h) ∈ Rdh+1 ×dh are weight matrices and b (h) ∈ Rdh+1 are bias vectors in the h-th hidden
layer, and σ : R → R is a coordinate-wise smooth activation function. The final output of the neural network is given by
1 (L)
f ( x; θ ) = √ W L · f ( L ) + b ( L ) , (2.3)
dL
2
where W ( L ) ∈ R1×d L and b( L ) ∈ R are the weight and bias parameters of the last layer. Here, θ = { W (0) , b(0) , . . . , W ( L ) , b( L ) }
represents all parameters of the neural network. Such a parameterization is known as the “NTK parameterization” following
the original work of Jacot et al. [33]. We remark that the scaling factors 1 are key to obtaining a consistent asymptotic
dh
behavior of neural networks as the widths of the hidden layers d1 , d2 , . . . , dh grow to infinity.
We initialize all the weights and biases to be independent and identically distributed (i.i.d.) as standard normal distri-
bution N (0, 1) random variables, and consider the sequential limit of hidden widths d1 , d2 , . . . , d L → ∞. As described in
[33,36,35], all coordinates of f (h) at each hidden layer asymptotically converge to an i.i.d. centered Gaussian process with
covariance function h−1 : Rdh−1 × Rdh−1 → R defined recursively as
( 0 ) ( x , x ) = x T x + 1 ,
(h−1)
(x, x) (h−1) x, x
(h) (x, x ) = ∈ R2×2 ,
(h−1) x , x (h−1) x , x (2.4)
(h)
( x, x ) = E [σ (u )σ ( v )] + 1,
(u , v )∼N (0,(h) )
for h = 1, 2, . . . , L.
To introduce the neural tangent kernel (NTK), we also need to define

˙ (h) x, x =
E [σ̇ (u )σ̇ ( v )] (2.5)
(u , v )∼N 0,(h)
where σ̇ denotes the derivative of the activation function σ .

Following the derivation of [33,38], the neural tangent kernel can be generally defined at any training time t, as the
neural network parameters θ (t ) are changing during model training by gradient descent. This definition takes the form

∂ f (x; θ (t )) ∂ f x ; θ (t )
∂ f (x; θ(t )) ∂ f (x ; θ(t ))
Kert (x, x ) = , = , (2.6)
∂θ ∂θ ∂θ ∂θ
θ ∈θ
where ·, · denotes an inner product over θ . This kernel converges in probability to a deterministic kernel ( L ) (x, x ) at
random initialization as the width of hidden layers goes to infinity [33]. Specifically,

∂ f (x; θ (0)) ∂ f x ; θ(0)
lim · · · lim Ker0 (x, x ) = lim · · · lim , = ( L ) (x, x ). (2.7)
d L →∞ d1 →∞ d L →∞ d1 →∞ ∂θ ∂θ
Here ( L ) (x, x ) denotes the limiting kernel function of a L-layer fully-connected neural network, which is defined by
L +1 L +1
( L ) x, x = (h−1) x, x · ˙ h x, x
(2.8)
h =1 h =h
where ˙ (L +1) (x, x ) = 1 for convenience. Moreover, Jacot et al. [33] proved that, under some suitable conditions and training
time T fixed, Kert converges to ( L ) for all 0 ≤ t ≤ T , as the width goes to infinity. As a consequence, a properly randomly
initialized and sufficiently wide deep neural network trained by gradient descent is equivalent to a kernel regression with a
deterministic kernel.
3. Physics-informed neural networks (PINNs)
In this section, we study physics-informed neural networks (PINNs) and their corresponding neural tangent kernels. To
this end, we consider the following well-posed partial differential equation (PDE) defined on a bounded domain ⊂ Rd
N [u ](x) = f (x), x∈ (3.1)

u (x) = g (x), x ∈ ∂ (3.2)
where N denotes differential operator and u (x) : → R is the unknown solution with x = (x1 , x2 , · · · , xd ). Here we remark
that for time-dependent problems, we consider time t as an additional coordinate in x and denotes the spatio-temporal
domain. Then, the initial condition can be simply treated as a special type of Dirichlet boundary condition and included in
equation (3.2).
Following the original work of Raissi et al. [39], we assume that the latent solution u (x) can be approximated by a deep
neural network u (x, θ) with parameters θ , where θ is a collection of all the parameters in the network. We can then define
the PDE residual r (x; θ ) as
3
r (x; θ ) := N [u ](x; θ) − f (x). (3.3)
Note that the parameters of u (x; θ ) can be “learned” by minimizing the following composite loss function
L(θ ) = Lb (θ ) + Lr (θ ), (3.4)
where
Nb
1
Lb (θ ) = |u (xbi ; θ ) − g (xbi )|2 (3.5)
2
i =1
Nr
1
Lr (θ ) = |r (xri ; θ )|2 . (3.6)
2
i =1
N N
Here N b and N r denote the batch sizes for the training data {xbi , g (xbi )}i =b1 and {xri , f (xbi )}i =r 1 respectively, which can be
randomly sampled at each iteration of a gradient descent algorithm.
3.1. Neural tangent kernel theory for PINNs
In this section we derive the neural tangent kernel of a physics-informed neural network. To this end, consider mini-
mizing the loss function (3.4) by gradient descent with an infinitesimally small learning rate, yielding the continuous-time
gradient flow system
dθ
= −∇ L(θ ), (3.7)
dt
N N
and let u (t ) = u (xb ; θ (t )) = {u (xbi ; θ (t ))}i =b1 and N [u ](t ) = N [u ](xr ; θ (t )) = {N [u ](xr ; θ (t ))}i =r 1 . Then the following lemma
characterizes how u (t ) and N [u ](t ) evolve during training by gradient descent.
N N
Lemma 3.1. Given the data points {xbi , g (xbi )}i =b1 , {xri , f (xri )}i =r 1 and the gradient flow (3.7), u (t ) and N [u ](t ) obey the following
evolution

du (xb ;θ (t ))
K uu (t ) K ur (t ) u (xb ; θ (t )) − g (xb )
dt
dN [u ](xr ;θ (t )) =− · , (3.8)
K ru (t ) K rr (t ) N [u ](xr ; θ(t )) − f (xr )
dt
where K ru (t ) = K ur
T
(t ) and K uu (t ) ∈ R Nb × Nb , K ur (t ) ∈ R Nb × Nr , and K rr (t ) ∈ R Nr × Nr whose (i , j )-th entry is given by
du (xi ; θ (t )) du (x j ; θ (t ))
b b
( K uu )i j (t ) = ,
dθ dθ
du (xi ; θ (t )) dN [u ](x j ; θ(t ))
b r (3.9)
( K ur )i j (t ) = ,
dθ dθ
dN [u ](xi ; θ(t )) dN [u ](x j ; θ(t ))
r r
( K rr )i j (t ) = ,
dθ dθ
Proof. The proof of Lemma 3.1 is given in Appendix A.
Remark 3.2. ·, · here denotes the inner product over all neural network parameters in θ . For example,
j
du (xbi ; θ(t )) du (xb ; θ(t ))
( K uu )i j (t ) = · .
dθ dθ
θ ∈θ

K uu (t ) K ur (t )
Remark 3.3. We will denote the matrix by K (t ) in the following sections. It is easy to see that
K ru (t ) K rr (t )
K uu (t ), K rr (t ) and K (t ) are all positive semi-definite matrices. Indeed, let J u (t ) and J r (t ) be the Jacobian matrices of
u (t ) and N [u ](t ) with respect to θ respectively. Then, we can observe that

J u (t ) T
K uu (t ) = J u (t ) J uT (t ), K rr (t ) = J r (t ) J rT (t ), K (t ) = J u (t ), J rT (t ) .
J r (t )
4
Remark 3.4. It is worth pointing out that equation (3.8) holds for any differential operator L and any neural network
architecture.
The statement of Lemma 3.1 involves the matrix K (t ), which we call the neural tangent kernel of a physics-informed neural
network (NTK of PINNs). Recall that an infinitely wide neural network is a Gaussian process, and its NTK remains constant
during training [33]. Now two natural questions arise: how does the PDE residual behave in the infinite width limit? Does
the NTK of PINNs exhibit similar behavior as the standard NTK? If so, what is the expression of the corresponding kernel?
In the next subsections, we will answer these questions and show that, in the infinite-width limit, the NTK of PINNs indeed
converges to a deterministic kernel at initialization and then remains constant during training.
4. Analyzing the training dynamics of PINNs through the lens of their NTK
To simplify the proof and understand the key ideas clearly, we confine ourselves to a simple model problem using a
fully-connected neural network with one hidden layer. To this end, we consider a one-dimensional Poisson equation as our
model problem. Let be a bounded open interval in R. The partial differential equation is summarized as follows
u xx (x) = f (x), ∀x ∈
(4.1)
u (x) = g (x), x ∈ ∂ .
We proceed by approximating the solution u (x) by a fully-connected neural network denoted by u (x, θ ) with one hidden
layer. Now we define the network explicitly:
1
u (x; θ ) = √ W (1) · σ ( W (0) x + b(0) ) + b(1) , (4.2)
N
where W (0) ∈ R N ×1 , W (1) ∈ R1× N are weights, b(0) ∈ R N , b(1) ∈ R1 are biases θ = ( W (0) , W (1) , b(0) , b(1) ) represents all
parameters in the network, and σ is a smooth activation function. Then it is straightforward to show that
1
u xx (x; θ ) = √ W (1) · σ̈ ( W (0) x + b(0) ) W (0) W (0) , (4.3)
N
where denotes point-wise multiplication and σ̈ denotes the second order derivative of the activation function σ .
We initialize all the weights and biases to be i.i.d. N (0, 1) random variables. Based on our presentation in section 2, we
already know that, in the infinite width limit, u (x; θ ) is a centered Gaussian process with covariance function (1) (x, x ) at
initialization, which is defined in equation (2.4). The following theorem reveals that u xx (x; θ ) converges in distribution to
(1)
another centered Gaussian process with a covariance function xx under the same limit.
Theorem 4.1. Assume that the activation function σ is smooth and has a bounded second order derivative σ̈ . Then for a fully-connected
neural network of one hidden layer at initialization,
D
→ GP (0, (1) (x, x ))
u (x; θ ) − (4.4)
D
→ GP (0, xx (x, x )),
u xx (x; θ ) −
(1 )
(4.5)
as N → ∞, where D means convergence in distribution and

(xx1) (x, x ) = E u 4 σ̈ (ux + v )σ̈ (ux + v ) . (4.6)
u , v ∼N (0,1)
Proof. The proof can be found in Appendix B.
Remark 4.2. By induction, the proof of Theorem 4.1 can be extended to differential operators of any order and fully-
connected neural networks with multiple hidden layers. Observe that a linear combination of Gaussian processes is still
a Gaussian process. Therefore, Theorem 4.1 can be generalized to any linear partial differential operator under appropriate
regularity conditions.
As an immediate corollary, a sufficiently wide physics-informed neural network for model problem (4.1) induces a joint
Gaussian process (GP) between the function values and the PDE residual at initialization, indicating a PINNs-GP correspon-
dence for linear PDEs.
The next question we investigate is whether the NTK of PINNs behaves similarly as the NTK of standard neural networks.
The next theorem proves that indeed the kernel K (0) converges in probability to a certain deterministic kernel matrix as
the width of the network goes to infinity.
5
Theorem 4.3. For a physics-informed network with one hidden layer at initialization, and in the limit as the layer’s width N → ∞, the
NTK K (t ) of the PINNs model defined in equation (3.9) converges in probability to a deterministic limiting kernel, i.e.,

1)
K uu (0) K ur (0) (uu (ur1) ∗
K (0) = → (1 ) (1) := K , (4.7)
K ru (0) K rr (0) ru rr
where the explicit expression of K ∗ is provided in appendix C.
Proof. The proof can be found in Appendix C.
Our second key result is that the NTK of PINNs stays asymptotically constant during training, i.e. K (t ) ≈ K (0) for all t. To
state and prove the theorem rigorously, we may assume that all parameters and the loss function do not blow up and are
uniformly bounded during training. The first two assumptions are both reasonable and practical, otherwise one would obtain
unstable and divergent training processes. In addition, the activation has to be 4-th order smooth and all its derivatives are
bounded. The last assumption is not a strong restriction since it is satisfied by most of the activation functions commonly
used for PINNs such as sigmoids, hyperbolic tangents, sine functions, etc.
Theorem 4.4. For the model problem (4.1) with a fully-connected neural network of one hidden layer, consider minimizing the loss
function (3.4) by gradient descent with an infinitesimally small learning rate. For any T > 0 satisfying the following assumptions:
(i) there exists a constant C > 0 such that all parameters of the network is uniformly bounded for t ∈ T , i.e.
sup θ (t )∞ ≤ C
t ∈[0, T ]
where C does not depend on N.

(ii) there exists a constant C > 0 such that
T

Nb

u (xbi ; θ (τ )) − g (xbi ) dτ ≤ C
0 i =1
T

Nr

u xx (xri ; θ (τ )) − f (xri ) dτ ≤ C
0 i =1
(iii) the activation function σ is smooth and |σ (k) | ≤ C for k = 0, 1, 2, 3, 4, where σ (k) denotes k-th order derivative of σ .
Then we have
lim sup K (t ) − K (0)2 = 0, (4.8)

N →∞ t ∈[0, T ]
where K (t ) is the corresponding NTK of PINNs.
Proof. The proof can be found in Appendix D.
Here we provide some intuition behind the proof. The crucial observation is that all parameters of the network change
little during training (see Lemma D.2 in the Appendix). By intuition, for sufficient wide neural networks, any slight move-
ment of weights would contribute to a non-negligible change in the network output. As a result, the gradients of the outputs
u (x, θ ) and u xx (x, θ) with respect to parameters barely change (see Lemma D.4), and, therefore, the kernel remains almost
static during training.
Combining Theorem 4.3 and Theorem 4.4 we may conclude that, for the model problem of equation (4.1), we have
K (t ) ≈ K (0) ≈ K ∗ , ∀t ,
from which (and equation (3.8)) we immediately obtain

du (xb ;θ (t ))
u (xb ; θ (t )) − g (xb ) u (xb ; θ (t )) − g (xb )
dt
du xx (xr ;θ (t )) = − K (t ) · ≈ −K ∗ . (4.9)
u xx (xr ; θ (t )) − f (xr ) u xx (xr ; θ (t )) − f (xr ).
dt
Note that if the matrix K ∗ is invertible, then according to [40,31], the network’s outputs u (x, θ) and u xx (x, θ) can be
approximated for any arbitrary test data xtest after t steps of gradient descent as
6
g (x )
u (xtest ; θ (t )) ∗ ∗ −1 − K ∗t b
≈ K test ( K ) I −e · , (4.10)
u xx (xtest ; θ (t )) f (xr )
where K test is the NTK matrix between all points in xtest and all training data. Letting t → ∞, we obtain

u (xtest ; θ (∞)) ∗ g (xb )
≈ K test ( K ∗ )−1 · .
u xx (xtest ; θ (∞)) f (xr )
This implies that, under the assumption that K ∗ is invertible, an infinitely wide physics-informed neural network for model
problem (4.1) is also equivalent to a kernel regression. However, from the authors’ experience, the NTK of PINNs is always
degenerate (see Figs. 2c, 3c) which means that we cannot invert the kernel matrix and thus be able to casually perform
kernel regression predictions in practice.
5. Spectral bias in physics-informed neural networks
In this section, we will utilize the developed theory to investigate whether physics-informed neural networks are spec-
trally biased. The term “spectral bias” [29,41,32] refers to a well known pathology that prevents deep fully-connected
networks from learning high-frequency functions.
Since the NTK of PINNs barely changes during training, we may rewrite equation (4.9) as

du (xb ;θ (t ))
u (xb ; θ(t )) − g (xb )
dt
du xx (xr ;θ (t )) ≈ − K (0) , (5.1)
u xx (xr ; θ (t )) − f (xr )
dt
which leads to
g (x )
u (xb ; θ (t ))
≈ I − e − K (0)t · b
. (5.2)
u xx (xr ; θ(t )) f (xr )
As mentioned in Remark 3.3, the NTK of PINNs is also positive semi-definite. So we can take its spectral decomposition
K (0) = Q T Q , where Q is an orthogonal matrix and is a diagonal matrix whose entries are the eigenvalues λi ≥ 0 of
K (0). Consequently, the training error is given by
g (x ) g (x )
u (xb ; θ (t )) g (xb )
− ≈ I − e − K (0)t · b
− b
u xx (xr ; θ(t )) f (xr ) f (xr ) f (xr )

T −t g (xb )
≈ −Q e Q · ,
f (xr )
which is equivalent to

u (xb ; θ (t )) g (xb ) g (xb )
Q − ≈ −e −t Q · . (5.3)
u xx (xr ; θ (t )) f (xr ) f (xr )
This implies that the i-th component of the left hand side in equation (5.3) will decay approximately at the rate e −λi t . In
other words, the eigenvalues of the kernel characterize how fast the absolute training error decreases. Particularly, com-
ponents of the target function that correspond to kernel eigenvectors with larger eigenvalues will be learned faster. For
fully-connected networks, the eigenvectors corresponding to higher eigenvalues of the NTK matrix generally exhibit lower
frequencies [29,42,32]. From Fig. 1, one can observe that the eigenvalues of the NTK of PINNs decay rapidly. This results in
extremely slow convergence to the high-frequency components of the target function. Thus we may conclude that PINNs
suffer from the spectral bias either.
More generally, the NTK of PINNs after t steps of gradient descent is given by

K uu (t ) K ur (t ) J u (t ) T
K (t ) = = J u (t ), J rT (t ) = J (t ) J T (t ).
K ru (t ) K rr (t ) J r (t )
It follows that
Nb +Nr
λi (t ) = T r ( K (t )) = T r J (t ) J T (t ) = T r J T (t ) J (t )
i =1

= T r J uT (t ) J u (t ) + J rT (t ) J r (t ) = T r J u (t ) J uT (t ) + T r J r (t ) J rT (t )
Nb Nr
= λuu
i (t ) + λrr
i (t ),
i =1 i =1
7
Fig. 1. Model problem (1D Poisson equation): The eigenvalues of K , K uu and K rr at initialization in descending order for different fabricated solutions u (x) =
sin(aπ x) where a = 1, 2, 4.
where λi (t ), λuu rr
i (t ) and λi (t ) denote the eigenvalues of K (t ), K uu (t ) and K rr (t ), respectively. This reveals that the overall
convergence rate of the total training error is characterized by the eigenvalues of K uu and K rr together. Meanwhile, the
separate training error of u (xb ; θ ) and u xx (xr ; θ ) is determined by the eigenvalues of K uu and K rr , respectively. The above
observation motivates us to give the following definition.
Definition 5.1. For a positive semi-definite kernel matrix K ∈ Rn×n , the average convergence rate c is defined as the mean
of all its eigenvalues λi ’s, i.e.
n
i =1 λ i T r(K )
c= = . (5.4)
n n
In particular, for any two kernel matrices K 1 , K 2 with average convergence rate c 1 and c 2 respectively, we say that K 1
dominates K 2 if c 1 c 2 .
As a concrete example, we train a fully-connected neural network with one hidden layer and 100 neurons to solve
the model Problem 7.1 with a fabricated solution u (x) = sin(aπ x) for different frequency amplitudes a. Fig. 1 shows all
eigenvalues of K , K uu and K rr at initialization in descending order. As with conventional deep fully-connected networks,
the eigenvalues of the PINNs’ NTK decay rapidly and most of the eigenvalues are near zero. Moreover, the distribution of
eigenvalues of K looks similar for different frequency functions (different a), which may heuristically explain that PINNs
tend to learn all frequencies almost simultaneously, as observed in Lu et al. [43].
Another key observation here is that the eigenvalues of K rr are much greater than K uu , namely K rr dominates K uu by
Definition 5.1. As a consequence, the PDE residual converges much faster than fitting the PDE boundary conditions, which
may prevent the network from approximating the correct solution. From the authors’ experience, high frequency functions
typically lead to high eigenvalues in K rr , but in some cases K uu can dominate K rr . We believe that such a discrepancy
between K uu and K rr is one of the key fundamental reasons behind why PINNs can often fail to train and yield accurate
predictions. In light of this evidence, in the next section, we describe a practical technique to address this pathology by
appropriately assigning weights to the different terms in a PINNs loss function.
6. Practical insights
In this section, we consider general PDEs of the form (3.1) - (3.2) by leveraging the NTK theory we developed for PINNs.
We approximate the latent solution u (x) by a fully-connected neural network u (x, θ) with multiple hidden layers, and train
its parameters θ by minimizing the following composite loss function
L(θ ) = Lb (θ ) + Lr (θ ) (6.1)
Nb Nr
λb λr
= |u (xbi ; θ ) − g (xbi )|2 + |r (xri ; θ )|2 , (6.2)
2N b 2N r
i =1 i =1
where λb and λr are some hyper-parameters which may be tuned manually or automatically by utilizing the back-
N
propagated gradient statistics during training [28]. Here, the training data {xbi , g (xbi )}i =b1 and {xri , f (xbi )} may correspond
to the full data-batch or mini-batches that are randomly sampled at each iteration of gradient descent.
Similar to the proof of Lemma 3.1, we can derive the dynamics of the outputs u (x, θ) and u (x, θ) corresponding to the
above loss function as

du (xb ;θ (t )) λb λr
K uu (t ) K ur (t ) u (xb ; θ (t )) − g (xb )
dt
dN [u ](xr ;θ (t )) =− Nb Nr
· (6.3)
λb
K ru (t ) λr
K rr (t ) N [u ](xr ; θ(t )) − f (xr )
dt Nb Nr
8

u (xb ; θ (t )) − g (xb )
:= −
K (t ) · , (6.4)
N [u ](xr ; θ (t )) − f (xr )
where K uu , K ur and K rr are defined to be the same as in equation (3.9). From simple stability analysis of a gradient
descent (i.e. forward Euler [44]) discretization of above ODE system, the maximum learning rate should be less than or
equal to 2/λmax ( K (t )). Also note that an alternative mechanism for controlling stability is to increase the batch size, which
effectively corresponds to decreasing the learning rate. Recall that the current setup in the main theorems put forth in
this work holds for the model problem in equation (4.1) and fully-connected networks of one hidden layer with an NTK
parameterization. This implies that, for general nonlinear PDEs, the NTK of PINNs may not remain fixed during training.
Nevertheless, as mentioned in Remark 3.4, we emphasize that, given an infinitesimal learning rate, equation (6.3) holds for
any network architecture and any differential operator. Similarly, the singular values of NTK K (t ) determine the convergence
rate of the training error using singular value decomposition, since K (t ) may not necessarily be semi-positive definite.
Therefore, we can still understand the training dynamics of PINNs by tracking their NTK K (t ) during training, even for
general nonlinear PDE problems.
A key observation here is that the magnitude of λb , λr , as well as the size of mini-batch would have a crucial impact on
the singular values of K (t ), and, thus, the convergence rate of the training error of u (xb ; θ ) and N [u ](xr ; θ ). For instance, if
we increase λb and fix the batch size N b , N r and the weight λr , then this will improve the convergence rate of u (xb , θ). Fur-
thermore, in the sense of convergence rate, changing the weights λb or λr is equivalent to changing the corresponding batch
size N b , N r . Based on these observations, we can overcome the discrepancy between K uu and K rr discussed in section 5
by calibrating the weights or batch size such that each component of u (xr ; θ ) and N [u ](xr ; θ ) has similar convergence rate
in magnitude. Since manipulating the batch size may involve extra computational costs (e.g., it may result to prohibitively
very large batches), here we fix the batch size and just consider adjusting the weights λb or λr according to Algorithm 1.
Algorithm 1: Adaptive weights for physics-informed neural networks.

Consider a physics-informed neural network u (x; θ) with parameters θ and a loss function
L(θ) := λb Lb (θ) + λr Lr (θ),

where Lr (θ) denotes the PDE residual loss and Lr (θ ) corresponds to boundary conditions. Initialize the weights λb , λr to 1 and use S steps of a
gradient descent algorithm to update the parameters θ as:
for n = 1, . . . , S do
(a) Compute λb and λr by
Nr +Nb
λi (n) T r ( K (n))
λb = i =1
Nb
= (6.5)
uu T r ( K uu (n))
i =1 λi (n)
Nr +Nb
λi (n) T r ( K (n))
λr = i =N1 = (6.6)
r
λrr (n)
i =1 i
T r ( K rr (n))
where λi (n), λuu

i
(n) and λrr
i
(n) are eigenvalues of K (n), K uu (n), K rr (n) at n-th iteration.
(b) Update the parameters θ via gradient descent
θ n+1 = θ n − η∇θ L(θ n ) (6.7)
end
First we remark that the updates in equations (6.5) and (6.6) can either take place at every iteration of the gradient
descent loop, or at a frequency specified by the user (e.g., every 10 gradient descent steps). To compute the sum of eigen-
values, it suffices to compute the trace of the corresponding NTK matrices, which can save some computational resources.
Besides, we point out that the computation of the NTK K (t ) is associated with the training data points fed to the network at
each iteration, which means that the values of the kernel are not necessarily same at each iteration. However, if we assume
that all training data points are sampled from the same distribution and the change of NTK at each iteration is negligible,
then the computed kernel should be approximately equal up to a permutation matrix. As a result, the change of eigenvalues
of K (t ) at each iteration is also negligible and thus the training process of Algorithm 1 should be stable. In section 7.2, we
performed detailed numerical experiments to validate the effectiveness of the proposed algorithm.
Here we also note that, in previous work, Wang et al. introduced an alternative empirical approach for automatically
tuning the weights λb or λr with the goal of balancing the magnitudes of the back-propagated gradients originating from
different terms in a PINNs loss function. While effective in practice, this approach lacked any theoretical justification and did
not provide a deeper insight into the training dynamics of PINNs. In contrast, the approach presented here follows naturally
from the NTK theory derived in section 4, and aims to trace and tackle the pathological convergence behavior of PINNs at
its root.
9
Fig. 2. Model Problem 7.1 (1D Poisson equation): (a) (b) The relative change of parameters θ and the NTK of PINNs K obtained by training a fully-connected
neural network with one hidden layer and different widths (10, 100, 500) via 10, 000 iterations of full-batch gradient descent with a learning rate of 10−5 .
(c) The eigenvalues of the NTK K at initialization and at the last step ((n = 104 ) of training a width = 500 fully-connected neural network.
7. Numerical experiments
In this section, we provide a series of numerical studies that aim to validate our theory or access the performance of
the proposed algorithm against the standard PINNs [27] for inferring the solution of PDEs. Throughout numerical experi-
ments we will approximate the latent variables by fully-connected neural networks with NTK parameterization (2.3) and
hyperbolic tangent activation functions. All networks are trained using standard stochastic gradient descent, unless oth-
erwise specified. Finally, all results presented in this section can be reproduced using our publicly available code https://
github.com/PredictiveIntelligenceLab/PINNsNTK.
7.1. Convergence of the NTK of PINNs
As our first numerical example, we still focus on the model problem (4.1) and verify the convergence of the PINNs’
NTK. Specifically, we set to be the unit interval [0, 1] and fabricate the exact solution to this problem taking the form
u (x) = sin(π x). The corresponding f and g are given by
f (x) = −π 2 sin(π x), x ∈ [0, 1]

g (x) = 0, x = 0, 1.
We proceed by approximating the latent solution u (x) by a fully-connected neural network u (x; θ ) of one hidden layer with
NTK parameterization (see equation (2.3)), and a hyperbolic tangent activation function. The corresponding loss function is
given by
L(θ ) = Lb (θ ) + Lr (θ ) (7.1)
Nb Nr
1 1
= |u (xbi ; θ) − g (xbi )|2 + |u xx (xri ; θ) − f (xri )|2 (7.2)
2 2
i =1 i =1
N
Here we choose N b = N r = 100 and the collocation points {xri }i =r 1 are uniformly spaced in the unit interval. To monitor
the change of the NTK K (t ) for this PINN model, we train the network for different widths and for 10, 000 iterations by
minimizing the loss function given above using standard full-batch gradient descent with a learning rate of 10−5 . Here we
remark that, in order to keep the gradient descent dynamics (3.8) steady, the learning rate should be less than 2/λmax ,
where λmax denotes the maximum eigenvalue of K (t ).
Fig. 2a and 2b present the relative change in the norm of network’s weights and NTK (starting from a random initial-
ization) during training. As it can be seen, the change of both the weights and the NTK tends to zero as the width of the
network grows to infinity, which is consistent with Lemma D.2 and Theorem 4.4. Moreover, we know that convergence in a
matrix norm implies convergence in eigenvalues, and eigenvalues characterize the properties of a given matrix. To this end,
we compute and monitor all eigenvalues of K (t) of the network for width = 500 at initialization and after 10, 000 steps of
gradient and plot them in descending order in Fig. 2c. As expected, we see that all eigenvalues barely change for these two
snapshots. Based on these observations, we may conclude that the NTK of PINNs with one hidden layer stays almost fixed
during training.
However, PINNs of multiple hidden layers are not covered by our theory at the moment. Out of interest, we also inves-
tigate the relative change of weights, kernel, as well as the kernel’s eigenvalues for a fully-connected network with three
hidden layers (see Fig. 3). We can observe that the change in both the weights and the NTK behaves almost identical to the
10
Fig. 3. Model Problem 7.1 (1D Poisson equation): (a) (b) The relative change of parameters θ and the NTK of PINNs K obtained by training a fully-connected
neural network with three hidden layers and different widths (10, 100, 500) via 10, 000 iterations of full-batch gradient descent with a learning rate of
10−5 . (c) The eigenvalues of the NTK K at initialization and at the last step (n = 104 ) of training a width = 500 fully-connected neural network.
case of a fully-connected network with one hidden layer shown in Fig. 2. Therefore we may conjecture that, for any linear
or even nonlinear PDEs, the NTK of PINNs converges to a deterministic kernel and remains constant during training in the
infinite width limit.
7.2. Adaptive training for PINNs
In this section, we aim to validate the developed theory and examine the effectiveness of the proposed adaptive training
algorithm on the model problem of equation 7.1. To this end, we consider a fabricated exact solution of the form u (x) =
sin(4π x), inducing a corresponding forcing term f and Dirichlet boundary condition g given by
f (x) = −16π 2 sin(4π x), x ∈ [0, 1]
g (x) = 0, x = 0, 1.
We proceed by approximating the latent solution u (x) by a fully-connected neural network with one hidden layer and
width set to 100. Recall from Theorem 4.3 and Theorem 4.4, that the NTK barely changes during training. This implies
that the weights λb , λr are determined by NTK at initialization and thus they can be regarded as fixed weights during
training. Moreover, from Fig. 1, we already know that K rr dominates K uu for this example. Therefore, the updating rule for
hyper-parameters λb , λr at t step of gradient descent can be reduced to
Nb +Nr Nr
i =1
λi (t ) λrr (t ) T r ( K rr (0))
λb = Nb uu ≈ Ni =1 i
≈ (7.3)
λ (t ) b
λuu (t ) T r ( K uu (0))
i =1 i i =1 i
Nb +Nr
λi (t )
λr = i =N1 ≈ 1. (7.4)
r
λrr (t )
i =1 i
We proceed by training the network via full-batch gradient descent with a learning rate of 10−5 to minimize the following
loss function
L(θ ) = Lb (θ ) + Lr (θ )
Nb Nr
λb λr
= |u (xbi ; θ ) − g (xbi )|2 + |u xx (xri ; θ ) − f (xri )|2 ,
2N b 2N r
i =1 i =1
where the batch sizes are N b = N r = 100, λr = 1 and the computed λb ≈ 100.
A comparison of predicted solution u (x) between the original PINNs (λb = λr = 1) and PINNs with adaptive weights
after 40, 000 iterations are shown in Fig. 4. It can be observed that the proposed algorithm yields a much more accurate
predicted solution and improves the relative L 2 error by about two orders of magnitude. Furthermore, we also investigate
how the predicted performance of PINNs depends on the choice of different weights in the loss function. To this end, we fix
λr = 1 and train the same network, but now we manually tune λb . Fig. 5 presents a visual assessment of relative L 2 errors
of predicted solutions for different λb ∈ [1, 500] averaged over ten independent trials. One can see that the relative L 2 error
decreases rapidly to a local minimum as λb increases from 1 to about 100 and then shows oscillations as λb continues to
11
Fig. 4. Model Problem 7.2 (1D Poisson equation): (a) The predicted solution against the exact solution obtained by training a fully-connected neural network of
one hidden layer with width = 100 via 40, 000 iterations of full-batch gradient descent with a learning rate of 10−5 . The relative L 2 error is 2.40e − 01. (b)
The predicted solution against the exact solution obtained by training the same neural network using fixed weights λb = 100, λr = 1 via 40, 000 iterations
of full-batch gradient descent with a learning rate of 10−5 . The relative L 2 error is 1.63e − 03.
Fig. 5. Model problem of equation 7.2 (1D Poisson equation): The relative L 2 error of predicted solutions averaged over 10 independent trials by training a
fully-connected neural network of one hidden layer with width = 100 using different fixed weights λb ∈ [1, 500] for 40, 000 gradient descent iterations.
increase. Moreover, a large magnitude of λb seems to lead to a large standard deviation in the L 2 error, which may be due
to the imaginary eigenvalues of the indefinite kernel
K resulting in an unstable training process. This empirical simulation
study confirms that the weights λr = 1 and λb suggested by our theoretical analysis based on analyzing the NTK spectrum
are robust and closely agree with the optimal weights obtained via manual hyper-parameter tuning.
12
7.3. One-dimensional wave equation
As our last example, we present a study that demonstrates the effectiveness of Algorithm 1 in a practical problem for
which conventional PINNs models face severe diffuculties. To this end, we consider a one-dimensional wave equation in the
domain = [0, 1] × [0, 1] taking the form
utt (x, t ) − 4u xx (x, t ) = 0, (x, t ) ∈ (0, 1) × (0, 1) (7.5)

u (0, t ) = u (1, t ) = 0, t ∈ [0, 1] (7.6)
1
u (x, 0) = sin(π x) + sin(4π x), x ∈ [0, 1] (7.7)
2
ut (x, 0) = 0, x ∈ [0, 1]. (7.8)
First, by d’Alembert’s formula [45], the solution u (x, t ) is given by

1
u (x, t ) = sin(π x) cos(2π t ) + sin(4π x) cos(8π t ). (7.9)
2
Here we treat the temporal coordinate t as an additional spatial coordinate in x and then the initial condition (7.7) can be
included in the boundary condition (7.6), namely
u (x) = g (x), x ∈ ∂
Now we approximate the solution u by a 5-layer deep fully-connected network u (x, θ) with 500 neurons per hidden layer,
where x = (x, t ). Then we can formulate a “physics-informed” loss function by
L(θ ) = λu Lu (θ ) + λut Lut (θ ) + λr Lr (θ) (7.10)

Nu N ut Nr
λu λu t λr
= |u (xui ; θ ) − g (xui )|2 + |ut (xiut ; θ)|2 + |N u (xri ; θ)|2 , (7.11)
2N u 2N ut 2N r
i =1 i =1 i =1
where the hyper-parameters λu , λut , λr are initialized to 1, the batch sizes are set to N u = N ut = N r = 300, and N =
∂tt − 4∂xx . Here all training data are uniformly sampling inside the computational domain at each gradient descent iteration.
The network u (x; θ ) is initialized using the standard Glorot scheme [46] and then trained by minimizing the above loss
function via stochastic gradient descent using the Adam optimizer with default settings [47]. Fig. 6a provides a comparison
between the predicted solution against the ground truth obtained after 80, 000 training iterations. Clearly the original PINN
model fails to approximate the ground truth solution and the relative L 2 error is above 40%.
To explore the reason behind PINN’s failure for this example, we compute its NTK and track it during training. Similar to
the proof of Lemma 3.1, the corresponding NTK can be derived from the loss function (7.10)
⎡ du (xu ;θ (t ))
⎤ ⎡ ⎤
dt u (xb ; θ (t )) − g (xb )
⎢ dut (xut ;θ (t )) ⎥
⎣ dt
⎦ :=
K (t ) · ⎣ ut (xut ; θ(t )) ⎦, (7.12)
dN [u ](xr ;θ (t )) N [u ](xr ; θ(t ))
dt
where
⎡ λu ⎤
Nu
J u (t )
⎢ λu ⎥
K (t ) = ⎣ N t J ut (t ) ⎦ · J uT (t ), J uT t (t ), J rT (t ) ,
u t
λr
Nr
J r (t )

j
du (xui ; θ(t )) du (xu ; θ(t ))
[K u (t )]i j = J u (t ) J uT (t ) = ,
ij dθ dθ

dut (xut ; θ(t )) du (xuj t ; θ (t ))
i
T
K ut (t ) i j = J ut (t ) J ut (t ) = ,
ij dθ dθ

j
dN [u ](xr ; θ (t )) dN [u ](xr ; θ (t ))
i
T
[K r (t )]i j = J r (t ) J r (t ) = , .
ij dθ dθ
A visual assessment of the eigenvalues of K u , K ut and K r at initialization and the last step of gradient descent are presented
in Fig. 7. It can be observed that the NTK does not remain fixed and all eigenvalues move “outward” in the beginning of the
training, and then remain almost static such that K r and K ut dominate K u during training. Consequently, the components
13
Fig. 6. One-dimensional wave equation: (a) The predicted solution versus the exact solution by training a fully-connected neural network with five hidden
layers and 500 neurons per layer using the Adam optimizer with default settings [47] after 80, 000 iterations. The relative L 2 error is 4.518e − 01. (b) The
predicted solution versus the exact solution by training the same network using Algorithm 1 after 80, 000 iterations. The relative L 2 error is 1.728e − 03.
(For interpretation of the colors in the figure(s), the reader is referred to the web version of this article.)
Fig. 7. One-dimensional wave equation: Eigenvalues of K u , K ut and K r at different snapshots during training, sorted in descending order.
of ut (xut ; θ ) and N [u ](xr ; θ ) converge much faster than the loss of boundary conditions, and, therefore, introduce a severe
discrepancy in the convergence rate of each different term in the loss, causing this standard PINNs model to collapse. To
verify our hypothesis, we also train the same network using Algorithm 1 with the following generalized updating rule for
hyper-parameters λu , λut and λr
T r ( K u ) + T r ( K ut ) + T r ( K r )
λu = (7.13)
T r( K u )
T r ( K u ) + T r ( K ut ) + T r ( K r )
λu t = (7.14)
T r ( K ut )
T r ( K u ) + T r ( K ut ) + T r ( K r )
λr = . (7.15)
T r( K r )
14
Fig. 8. One-dimensional wave equation: (a) The evolution of hyper-parameters λu , λut and λr during training of a five-layer deep fully-connected neural
network with 500 neurons per layer using Algorithm 1. (b) The eigenvalues of K u , K ut , K r and λu K u , λut K ut , λr K r at last step of training.
In particular, we update these weights every 1, 000 training iterations, hence the extra computational costs compared to a
standard PINNs approach is negligible. The results of this experiment are shown in Fig. 6b, from which one can easily see
that the predicted solution obtained using the proposed adaptive training scheme achieves excellent agreement with the
ground truth and the relative L 2 error is 1.73e − 3. To quantify the effect of the hyper-parameters λu , λut and λr on the
NTK, we also compare the eigenvalues of K u , K ut and K r multiplied with or without the hyper-parameters at last step
of gradient descent. As it can be seen in Fig. 8b, the discrepancy of the convergence rate of different components in total
training errors is considerably resolved. Furthermore, Fig. 8a presents the change of weights during training and we can see
that λu , λut increase rapidly and then remain almost fixed while λr is near 1 for all time. So we may conclude that the
overall training process using Algorithm 1 is stable.
8. Discussion
This work has produced a novel theoretical understanding of physics-informed neural networks by deriving and analyzing
their limiting neural tangent kernel. Specifically, we first show that infinitely wide physics-informed neural networks under
the NTK parameterization converge to Gaussian processes. Furthermore, we derive NTK of PINNs and show that, under
suitable assumptions, it converges to a deterministic kernel and barely changes during training as the width of the network
grows to infinity. To provide further insight, we analyze the training dynamics of fully-connected PINNs through the lens
of their NTK and show that not only they suffer from spectral bias, but they also exhibit a discrepancy in the convergence
rate among the different loss components contributing to the total training error. To resolve this discrepancy, we propose a
novel algorithm such that the coefficients of different terms in a PINNs’ loss function can be dynamically updated according
to balance the average convergence rate of different components in the total training error. Finally, we carry out a series of
numerical experiments to verify our theory and validate the effectiveness of the proposed algorithms.
Although this work takes an essential step towards understanding PINNs and their training dynamics, there are many
open questions worth exploring. Can the proposed NTK theory for PINNs be extended fully-connected networks with mul-
tiple hidden layers, nonlinear equations, as well as the neural network architectures such as convolutional neural networks,
residual networks, etc.? To which extend these architecture suffer from spectral bias or exhibit similar discrepancies in
their convergence rate? In a parallel thrust, it is well-known that PINNs perform much better for inverse problems than
for forward problems, such as the ones considered in this work. Can we incorporate the current theory to analyze inverse
problems and explain they are better suited to PINNs? Moreover, going beyond vanilla gradient descent dynamics, how
do the training dynamics of PINNs and their corresponding NTK evolve via gradient descent with momentum (e.g. Adam
[47])? In practice, despite some improvements in the performance of PINNs brought by assigning appropriate weights to the
loss function, we emphasize that such methods cannot change the distribution of eigenvalues of the NTK and, thus, cannot
directly resolve spectral bias. Apart from this, assigning weights may result in indefinite kernels which can have imaginary
eigenvalues and thus yield unstable training processes. Therefore, can we come up with better methodologies to resolve
spectral bias using specialized network architectures, loss functions, etc.? In a broad sense, PINNs can be regarded as a spe-
cial multi-task learning problem in which a neural network is asked to simultaneously fit the observed data and minimize
a PDE residual. It is then natural to ask: does the proposed NTK theory hold for general multi-task learning problems? Can
other heuristic methods for multi-task learning [48–50] be analyzed and improved under this setting? We believe that an-
swering these questions not only paves a new way to better understand PINNs and its training dynamics, but also opens a
new door for developing scientific machine learning algorithms with provable convergence guarantees, as needed for many
critical applications in computational science and engineering.
15
CRediT authorship contribution statement
Sifan Wang: Conceptualization, Methodology, Implementation, Numerical experiments, Visualization, Draft Preparation.
Xinling Yu: Numerical experiments.
Paris Perdikaris: Conceptualization, Methodology, Supervision, Funding, Writing-Reviewing and Editing.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
Acknowledgements
This work received support from DOE grant DE-SC0019116, AFOSR grant FA9550-20-1-0060, and DOE-ARPA grant DE-
AR0001201.
Appendix A. Proof of Lemma 3.1
N N
Proof. Recall that for given training data {xbi , g (xbi )}i =b1 , {xri , f (xri )}i =r 1 , the loss function is given by
L(θ ) = Lb (θ ) + Lr (θ )
Nb Nr
1 1
= |u (xbi ; θ) − g (xbi )|2 + |r (xri ; θ)|2 .
2 2
i =1 i =1
Now let us consider the corresponding gradient flow

Nb Nr
dθ ∂u i ∂ N [u ] i
= −∇θ L(θ) = −[ (u (xbi ; θ (t )) − g (xbi )) (x ; θ (t )) + (N [u ](xri ; θ(t )) − f (xri )) (xr ; θ (t ))].
dt ∂θ b ∂θ
i =1 i =1
It follows that for 0 ≤ j ≤ N b ,
j j T
du (xb ; θ(t )) du (xb ; θ (t )) dθ
= ·
dt dθ dt
T
j
du (xb ; θ(t )) Nb
∂u i
Nr
∂ N [u ] i
=− · (u (xbi ; θ(t )) − g (xbi )) (x ; θ (t )) + (L(xri ; θ (t )) − f (xri )) (xr ; θ(t ))
dθ ∂θ b ∂θ
i =1 i =1
Nb du (xi ; θ (t )) du (x j ; θ (t ))
b b
=− (u (xbi ; θ ) − g (xbi )) ,
dθ dθ
i =1
Nr N [u ](xi ; θ (t )) du (x j ; θ (t ))
r b
− (N [u ](xri ; θ ) − f (xri )) , .
dθ dθ
i =1
Similarly,
j j T
dN [u ](xr ; θ (t )) dN [u ](xr ; θ(t )) dθ
= ·
dt dθ dt
j
dN [u ](xr ; θ(t ))
T
Nb
∂u i
Nr
∂ N [u ] i
= · (u (xbi ; θ (t )) − g (xbi )) (x ; θ (t )) + (L(xri ; θ (t )) − f (xri )) (xr ; θ(t ))
dθ ∂θ b ∂θ
i =1 i =1
Nb du (xi ; θ (t )) dN [u ](x j ; θ (t ))
b b
=− (u (xbi ; θ ) − g (xbi )) ,
dθ dθ
i =1
Nr N [u ](xi ; θ (t )) dN [u ](x j ; θ (t ))
r r
− (N [u ](xri ; θ ) − f (xri )) , .
dθ dθ
i =1
Then we can rewrite the above equations as
16
du (xb ; θ(t ))
= − K uu (t ) · (u (xb ; θ(t )) − g (xb )) − K ur (t ) · (N [u ](xr ; θ (t )) − f (xr )) (A.1)
dt
dN [u ](xr ; θ (t ))
= − K ru (t ) · (u (xb ; θ (t )) − g (xb )) − K rr (t ) · (N [u ](xr ; θ (t )) − f (xr )), (A.2)
dt
where K ru (t ) = K ur
T
(t ) and the (i , j )-th entries are given by
du (xi ; θ (t )) du (x j ; θ (t ))
b b
( K uu )i j (t ) = ,
dθ dθ
du (xi ; θ (t )) dN [u ](x j ; θ(t ))
b r
( K ur )i j (t ) = ,
dθ dθ
dN [u ](xi ; θ(t )) dN [u ](x j ; θ(t ))
r r
( K rr )i j (t ) = , .
dθ dθ
Appendix B. Proof of Theorem 4.1
Proof. Recall equation (4.3) and that all weights and biases are initialized by independent standard Gaussian distributions.
Then by the central limit theorem, we have
1
u xx (x; θ ) = √ W (1) · σ̈ ( W (0) x + b(0) ) W (0) W (0)
N
N
1 (1 )
=√ Wk σ̈ ( W k(0) x + bk(0) )( W k(0) )2
N k =1
D
−
→ N (0, (x)) Y (x),
as N → ∞, where D denotes convergence in distribution and Y (x) is a centered Gaussian random variable with covariance
(1 ) ( 0) ( 0) ( 0)
(x) = Var[ W k σ̈ ( W k x + bk )( W k )2 ].
Since σ̈ is bounded, we may assume that |σ̈ | ≤ C . Then we have

sup E |u xx (x; θ)|2 = sup E E |u xx (x; θ)|2 | W (0) , b(0)
N N
1 N
2
( 0 ) 4
= sup E Wk σ̈ ( W k(0) x + bk(0) )
N N
k =1
( 0 ) 4
≤ CE W k < ∞.
This implies that u xx (x; θ ) is uniformly integrable with respect to N. Now for any given point x, x , we have

xx (x, x ) E Y (x)Y (x ) = lim E u xx (x, θ )u xx (x , θ ))
(1 )
N →∞

= lim E E[u xx (x, θ )u xx (x , θ )| W (0) , b(0) ]
N →∞

1 ( 0) ( 0) ( 0) ( 0)
T
( 0) ( 0) ( 0) ( 0)

= lim E σ̈ ( W x+b )W W σ̈ ( W x +b )W W
N →∞ N
1 N
( W k )4 σ̈ ( W k x + bk )σ̈ ( W k x + bk )
( 0) ( 0) ( 0) ( 0) ( 0)
= lim E
N →∞ N
k =1

= E u 4 σ̈ (ux + v )σ̈ (ux + v ) .
u , v ∼N (0,1)
This concludes the proof.
17
Appendix C. Proof of Theorem 4.3
Proof. To warm up, we first compute K uu (0) and its infinite width limit, which is already covered in [33]. By the definition
of K uu (0), for any two given input x, x we have
du (x; θ (0)) du (x , θ (0))
K uu (0) = , .
dθ dθ
Recall that
N
1 1 (1 ) ( 0) ( 0)
u (x; θ ) = √ W (1) · σ ( W (0) x + b(0) ) + b(1) = √ W k σ ( W k x + b k ) + b (1 ) ,
N N k =1
and θ = ( W (0) , W (1) , b(0) , b(1) ). Then we have

∂ u (x; θ ) 1
( 0)
= √ W k(1) σ̇ ( W k(0) x + bk(0) )x
∂Wk N
∂ u (x; θ ) 1
= √ σ ( W k(0) x + bk(0) )
∂ W k(1) N
∂ u (x; θ ) 1 (1 ) ( 0) ( 0)
( 0)
= √ W k σ̇ ( W k x + bk )
∂ bk N
∂ u (x; θ )
= 1.
∂ b (1 )
Then by the law of large numbers we have
N
∂ u (x; θ ) ∂ u (x ; θ) 1
N
( 0) ( 0)
= Wk
(1 )
σ̇ ( W k(0) x + bk(0) )x · W k(1) σ̇ ( W k(0) x + b(K0) )x
∂Wk ∂Wk N
k =1 k =1
1
N
= ( W k(1) )2 σ̇ ( W k(0) x + bk(0) )σ̇ ( W k(0) x + bk(0) ) (xx )
N
k =1

P
−→ E ( W k(1) )2 σ̇ ( W k(0) x + bk(0) )σ̇ ( W k(0) x + bk(0) ) (xx )

˙ (1) (x, x )(xx ),
= E ( W k(1) )2 E σ̇ ( W k(0) x + bk(0) )σ̇ ( W k(0) x + bk(0) ) (xx ) =
˙ (1) (x, x ) is defined in equation (2.5).
as N → ∞, where
Moreover,
N N
∂ u (x; θ ) ∂ u (x ; θ) 1
(1 ) (1 )
= σ ( W k(0) x + bk0 )σ ( W k(0) x + bk(0) )
∂Wk ∂Wk N
k =1 k =1
P

−→ E σ ( W k x + bk )σ ( W k x + bk ) = (1) (x, x ),
( 0) ( 0) ( 0) ( 0)
as N → ∞, where (1) (x, x ) is defined in equation (2.4).

Also,
N
∂ u (x; θ ) ∂ u (x ; θ)
N
1 ( 1 ) 2
( 0) ( 0)
= Wk σ̇ ( W k(0) x + bk(0) )σ̇ ( W k(0) x + bk(0) )
∂ bk ∂ bk N
k =1 k =1
P

−→ E σ̇ ( W k x + bk )σ̇ ( W k x + bk ) =
( 0) ( 0) ( 0) ( 0) ˙ (1) (x, x ).
Then plugging all these together we obtain

du (x; θ (0)) du (x , θ (0))
K uu (0) = ,
dθ dθ
1 N N
∂ u (x; θ ) ∂ u (x ; θ) ∂ u (x; θ ) ∂ u (x ; θ ) ∂ u (x; θ ) ∂ u (x ; θ)
= + ( 0) ( 0)
+
l =0 k =1 ∂Wk
(l)
∂Wk
(l)
k =1 ∂ bk ∂ bk ∂ b (1 ) ∂ b (1 )
18
P
˙ (1) (x, x )(xx ) + (1) (x, x ) +
−→ ˙ (1) (x, x ) + 1 uu , (1 )
as N → ∞. This formula is also consistent with equation (2.8).

Next, we compute K rr (0). To this end, recall that
1 1
N
(1 ) ( 0) ( 0) ( 0)
u xx (x; θ ) = √ W (1) · σ̈ ( W (0) x + b(0) ) W (0) W (0) = √ W k ( W k )2 σ̈ ( W k x + bk ).
N N k =1
It is then easy to compute that
∂ u xx (x; θ ) 1
(1 ) ( 0) (0) ...
( 0)
= √ Wk Wk Wk σ ( W k(0) x + bk(0) )x + 2σ̈ ( W k(0) x + bk(0) )
∂Wk N
∂ u xx (x; θ ) 1
(1 )
= √ ( W k(0) )2 σ̈ ( W k(0) x + bk(0) )
∂Wk N
∂ u xx (x; θ ) 1 (1 ) ( 0) ... ( 0) ( 0)
( 0)
= √ W k ( W k )2 σ ( W k x + bk ),
∂ bk N
...
where σ denotes third order derivative of σ . Then we have
N
∂ u xx (x; θ ) ∂ u xx (x ; θ ) 1
N 2
(1 ) ( 0) (0) ...
= Wk Wk Wk σ ( W k(0) x + bk(0) )x + 2σ̈ ( W k(0) x + bk(0) )
k =1 ∂ W k(0) ∂ W k(0) N
k =1

(0) ...
· Wk σ ( W k(0) x + bk(0) )x + 2σ̈ ( W k(0) x + bk(0) )
= I1 + I2 + I3 + I4,
where
N
1 ( 1 ) 2 (0) 4 ... ...
I1 = Wk Wk σ ( W k(0) x + bk(0) )x · σ ( W k(0) x + bk(0) )x
N
k =1
2
N 2 3 ...
I2 = Wk
(1 )
Wk
( 0)
σ ( W k(0) x + bk(0) ) · σ̈ ( W k(0) x + bk(0) )x
N
k =1
2
N 2 3 ...
I3 = Wk
(1 )
Wk
( 0)
σ ( W k(0) x + bk(0) ) · σ̈ ( W k(0) x + bk(0) )x
N
k =1
4
N 2
I4 =
(1 )
Wk Wk
( 0)
σ̈ ( W k(0) x + bk(0) ) · σ̈ ( W k(0) x + bk(0) ).
N
k =1
By the law of large numbers, letting N → ∞ gives

P (0) 4 ... ...
I 1 −→ E Wk σ ( W k(0) x + bk(0) ) · σ ( W k(0) x + bk(0) ) xx := J 1

P (0) 3 ...
I 2 −→ E Wk σ ( W k(0) x + bk(0) ) · σ̈ ( W k(0) x + bk(0) ) x := J 2

P (0) 3 ...
I 3 −→ E Wk σ ( W k(0) x + bk(0) ) · σ̈ ( W k(0) x + bk(0) ) x := J 3

P ( 0 ) 2
I 4 −→ E Wk σ̈ ( W k(0) x + bk(0) ) · σ̈ ( W k(0) x + bk(0) ) := J 4 .
In conclusion we have
N
∂ u xx (x; θ ) ∂ u xx (x ; θ ) P
( 0) ( 0)
−→ J 1 + J 2 + J 3 + J 4 := A rr .
k =1 ∂Wk ∂Wk
Moreover,
19
N N
∂ u xx (x; θ ) ∂ u xx (x ; θ ) 1 ( 0 ) 4
(1 ) (1 )
= Wk σ̈ ( W k(0) x + bk(0) ) · σ̈ ( W k(0) x + bk(0) )
∂Wk ∂Wk N
k =1 k =1

P ( 0 ) 4
−→ E W k σ̈ ( W k(0) x + bk(0) ) · σ̈ ( W k(0) x + bk(0) ) := B rr ,
and
N
∂ u xx (x; θ ) ∂ u xx (x ; θ ) 1
N ... ...
= ( W k(1) )2 ( W k(0) )4 σ ( W k(0) x + bk(0) ) · σ ( W k(0) x + bk(0) )
k =1 ∂ bk(0) ∂ bk(0) N
k =1

P (0) ... ... ( 0)
−→ E ( W k )4 σ ( W k x + bk ) · σ ( W k x + bk ) := C rr .
( 0) ( 0) ( 0)
Now, recall that
du (x; θ (0)) du (x ; θ (0))

xx xx
K rr (0) = , .
dθ dθ
Thus we can conclude that as N → ∞,
P
K rr (0) −→ A rr + B rr + C rr := rr (x, x ).
Finally, recall that K ur (x, x ) = K ru (x , x). So it suffices to compute K ur (x, x ) and its limit. To this end, recall that
du (x; θ(t )) du (x ; θ (t ))

K ur (x, x ) =
xx
, .
dθ dθ
Then letting N → ∞ gives
N
∂ u (x; θ ) ∂ u xx (x ; θ)
k =1 ∂ W k(0) ∂ W k(0)
N

1 ( 1 ) 2 ...
= Wk Wk
( 0)
σ̇ ( W k(0) x + bk(0) )x · W k(0) σ ( W k(0) x + bk(0) )x + 2σ̈ ( W k(0) x + bk(0) )
N
k =1

P ( 0 ) 2 ...
−→ E W k σ̇ ( W k(0) x + bk(0) ) · σ ( W k(0) x + bk(0) ) (xx ) + 2E W k(0) σ̇ ( W k(0) x + bk(0) ) · σ̈ ( W k(0) x + bk(0) ) x
:= A ur ,
and
N
∂ u (x; θ ) ∂ u xx (x ; θ )
N
1 ( 0 ) 2
(1 ) (1 )
= Wk σ ( W k(0) x + bk(0) ) · σ̈ ( W k(0) x + bk(0) )
∂Wk ∂Wk N
k =1 k =1
P
2
−→ E W k(0) σ ( W k(0) x + bk(0) ) · σ̈ ( W k(0) x + bk(0) ) := B ur ,
and
N
∂ u (x; θ ) ∂ u xx (x ; θ ) 1
N 2 ...
· σ̇ ( W k x + bk ) · σ ( W k x + bk )
(1 ) ( 0) ( 0) ( 0) ( 0) ( 0)
( 0) ( 0)
= Wk Wk
∂ bk ∂ bk N
k =1 k =1
P
2 ... 2
−→ E W k(0) · σ̇ ( W k(0) x + bk(0) ) · σ ( W k(0) x + bk(0) ) := C ur .
As a result, we obtain
du (x; θ(t )) du (x ; θ (t )) P

K ur (x, x ) =
xx (1 )
, −→ A ur + B ur + C ur : ur ,
dθ dθ
as N → ∞. This concludes the proof.
20
Appendix D. Proof of Theorem 4.4
Before we prove the main theorem, we need to prove a series of lemmas.
Lemma D.1. Under the setting of Theorem 4.4, for l = 0, 1, we have

# #
# ∂u #
sup # # = O( √1 ),
# (l) #
t ∈[0, T ] ∂ W ∞ N
# #
# ∂ u xx #
sup # # = O( √1 ),
# (l) #
t ∈[0, T ] ∂ W ∞ N
# #
# ∂u # 1
sup # #
# (0) # = O( √ N ),
t ∈[0, T ] ∂ b ∞
# #
# ∂ u xx # 1
sup # #
# (0) # = O( √ N ).
t ∈[0, T ] ∂ b ∞
Proof. For the given model problem, recall that

1
u (x; θ ) = √ W (1) σ ( W (0) (t )x + b(0) ) + b(1) ,
N
and
∂ u (x; θ ) 1 (1 ) ( 0) ( 0)
( 0)
= √ W k σ̇ ( W k x + bk )x
∂Wk N
∂ u (x; θ ) 1 ( 0) ( 0)
(1 )
= √ σ ( W k x + bk )
∂Wk N
∂ u (x; θ ) 1 (1 ) ( 0) ( 0)
( 0)
= √ W k σ̇ ( W k x + bk ).
∂ bk N
Then by assumptions (i), (ii), and given that is bounded, we have
# #
# ∂u #
sup # # # ≤ √C , l = 0, 1 .
(l) #
t ∈[0, T ] ∂ W ∞ N
# #
# ∂u # C
sup # #
# ( 0) # ≤ √ N .
t ∈[0, T ] ∂ b ∞
Also,
1 1
N
(1 ) ( 0) ( 0) ( 0)
u xx (x; θ ) = √ W (1) · σ̈ ( W (0) x + b(0) ) W (0) W (0) = √ W k ( W k )2 σ̈ ( W k x + bk ),
N N k =1
and
∂ u xx (x; θ ) 1
(1 ) ( 0) (0) ...
= √ Wk Wk Wk σ ( W k(0) x + bk(0) )x + 2σ̈ ( W k(0) x + bk(0) )
∂ W k(0) N
∂ u xx (x; θ ) 1 ( 0) ( 0) ( 0)
(1 )
= √ ( W k )2 σ̈ ( W k x + bk )
∂Wk N
∂ u xx (x; θ ) 1 (1 ) ( 0) ... ( 0) ( 0)
( 0)
= √ W k ( W k )2 σ ( W k x + bk ).
∂ bk N
Again, using assumptions (i), (ii) gives
# #
# ∂ u xx # C4
sup # # # ≤√ , l = 0, 1 .
(l) #
t ∈[0, T ] ∂ W ∞ N
# #
# ∂ u xx # C4
sup # #
# ∂ b ( 0) # ≤ √ N .
t ∈[0, T ] ∞
This completes the proof.
21
Lemma D.2. Under the setting of Theorem 4.4, we have

# 1 #
# #
lim sup # √ W (l) (t ) − W (l) (0) # = 0, l = 0, 1 . (D.1)
N →∞ t ∈[0, T ] N 2
# 1 #
# #
lim sup # √ b(0) (t ) − b(0) (0) # = 0. (D.2)
N →∞ t ∈[0, T ] N 2
Proof. Recall that the loss function for the model problem (4.1) is given by
Nb Nr
1 1
L(θ ) = Lb (θ ) + Lr (θ ) = |u (xbi ; θ ) − g (xbi )|2 + |u xx (xri ; θ ) − f (xri )|2 .
2 2
i =1 i =1
Consider minimizing the loss function L(θ ) by gradient descent with an infinitesimally small learning rate:
dθ
= −∇ L(θ ).
dt
This implies that
dW (l) ∂ L(θ)
=− , l = 0, 1 ,
dt ∂ W (l)
( 0)
db ∂ L(θ )
= − ( 0) .
dt ∂b
Then we have
# # # #
# # # t # # t #
# 1 (l) # # 1 dW (l) (τ ) # # 1 ∂ L (θ( τ )) #
#√ (l) # # # # dτ #
# N W (t ) − W (0) # = # √ N d τ
dτ # = # √
(l) #
2 # # # N ∂ W #
0 2 0 2
# #
# t Nb Nr #
# 1 ∂ u ∂ u #
=# (xi ; θ (τ )) dτ #
xx
# √N (u (xbi ; θ (τ )) − g (xbi )) (
(xi ; θ(τ )) +
l) b
(u xx (xri ; θ (τ )) − f (xri )) (l) r #
# i =1
∂W i =1
∂W #
0 2
≤ I 1 + I 2(l) ,
(l)
where
# #
# t Nb #
# 1 ∂ u #
I1 = # (x ; θ (τ )) dτ #
(l) i i i
# √N (u (xb ; θ(τ )) − g (xb )) (l) b #
# i =1
∂W #
0 2
# #
# t Nr #
# 1 ∂ u #
I2 = # (xi ; θ (τ )) dτ #
(l) xx
# √N (u xx (xri ; θ (τ )) − f (xri )) (l) r # .
# i =1
∂ W #
0 2
(l)
We first process to estimate I 1 as
t # #
# N b #
1 # ∂ u #
(l)
I1 ≤ √ # (u (xbi ; θ(τ )) − g (xbi )) (xi ; θ (τ )) #
# (l) b # dτ
N # i =1 ∂W #
0 2
$
t
% 2
1 % N Nb
∂u
=√ & u (xbi ; θ (τ )) − g (xbi ) ( xbi ; θ (τ )) dτ
(l)
N k =1 i =1 ∂Wk
0
T # $
# %
% 2
N Nb
1 # ∂u #
≤√ # i # &
N
# ∂ W (l) (xb ; θ(τ ))# u (xbi ; θ(τ )) − g (xbi ) dτ
∞ k =1 i =1
0
22
T # #
Nb
1 √ # ∂u #
=√ N# (
# ∂ W (l) bx i
; θ(τ ))#
# · u (xbi ; θ(τ )) − g (xbi ) dτ .
N ∞ i =1
0
Thus, by assumptions and Lemma D.1, for l = 0, 1 we have
T # #
Nb
# ∂u #
(l)
sup I 1 = sup # ( x i
; θ (τ ))# · u (xbi ; θ (τ )) − g (xbi ) dτ
# ∂ W (l) b #
t ∈[0, T ] t ∈[0, T ] ∞ i =1
0
C
≤√ −→ 0, as N −→ ∞.
N
Similarly,
T # $
# %
% 2
N Nb
1 # ∂u #
(l)
sup I 2 ≤ sup ≤ √ # i # &
N
# ∂ W (l) (xb ; θ(τ ))# u (xbi ; θ(τ )) − g (xbi ) dτ
t ∈[0, T ] t ∈[0, T ] ∞ k =1 i =1
0
T # #
1 √ # ∂ u xx i #
Nr

=√ #
N# (xr ; θ(τ ))# · u xx (xri ; θ(τ )) − f (xri ) dτ
N ∂ W (l) #
∞ i =1
0
C4
≤√ −→ 0, as N −→ ∞.
N
Plugging these together, we obtain
# #
# 1 (l) #
# W (t ) − W (0) #
(l) (l) (l)
lim sup # √
N →∞ N
# ≤ Nlim
→∞
sup I 1 + I 2 = 0,
t ∈[0, T ] 2 t ∈[0, T ]
for l = 1, 2. Similarly, applying the same strategy to b (0) we can show

# 1 #
# #
lim sup # √ b(0) (t ) − b(0) (0) # = 0.
N →∞ t ∈[0, T ] N 2

# 1 #
# #
lim sup # √ σ (k) ( W (0) (t )x + b(0) (t )) − σ (k) ( W (0) (t )x + b(0) (0)) # = 0, (D.3)
N →∞ t ∈[0, T ] N 2
for k = 0, 1, 2, 3, where σ (k) denotes the k-th order derivative of σ .
Proof. By the mean-value theorem for vector-valued function and Lemma D.2, there exists ξ
# #
# 1 (k) #
# √ σ ( W (0) (t )x + b(0) (t )) − σ (k) ( W (0) (0)x + b(0) (0)) #
# N #
2
# ### #
#
# # 1
≤ #σ (k+1) (ξ )# #
# √ W ( 0)
(t ) x + b ( 0)
(t ) − W ( 0)
(0 ) x + b ( 0)
(0 ) #
#
N 2
# # # #
# 1 ( 0) # # #
≤C# √ W (t ) − W ( 0)
(0 ) # + C # √1 b(0) (t ) − b(0) (0) #
# N # # N #
2 2
−→ 0,
as N → ∞. Here we use the assumption that σ (k) is bounded for k = 0, 1, 2, 3, 4. This concludes the proof.
23

# #
# ∂ u (x; θ (t )) ∂ u (x; θ (0)) #
lim sup # − # (D.4)
N →∞ t ∈[0, T ] # ∂θ ∂θ #
2
# #
# ∂ u xx (x; θ (t )) ∂ u xx (x; θ (0)) #
lim sup # # − # . (D.5)
N →∞ ∂θ ∂θ #
t ∈[0, T ] 2
Proof. Recall that

∂ u (x; θ ) 1
( 0)
= √ W k(1) σ̇ ( W k(0) x + bk(0) )x
∂Wk N
∂ u (x; θ ) 1
(1 )
= √ σ ( W k(0) x + bk(0) )
∂Wk N
∂ u (x; θ ) 1
( 0)
= √ W k(1) σ̇ ( W k(0) x + bk(0) )
∂ bk N
∂ u (x; θ )
= 1.
∂ b (1 )
To simplify notation, let us define
A (t ) = [ W (1) (t )] T
B (t ) = σ̇ ( W (0) (t )x + b(0) (t ))x.
Then by assumption (i) Lemma D.2 D.3, we have
# #
# ∂ u (x; θ (t )) ∂ u (x; θ (0)) #
sup # # − #
t ∈[0, T ] ∂ W ( 0) ∂ W ( 0) # 2
# #
# 1 #
= sup # √ # A (t ) B (t ) − A (0) B (0) #
N
#
t ∈[0, T ] 2
# # # #
# 1 # # 1 #
≤ sup # √ # # #
A (t ) − A (0) B (t )# + # √ A (0) B (t ) − B (0) #
N N
#
t ∈[0, T ] 2 2
# # # #
# 1 # # 1 #
#
≤ sup B (t )∞ # √ # #
A (t ) − A (0) # + sup A (0)∞ # √ B (t ) − B (0) #
N N
#
t ∈[0, T ] 2 t ∈[0, T ] 2
−→ 0,
as N → ∞. Here denotes point-wise multiplication.
Similarly, we can show that
# #
# ∂ u (x; θ (t ))
∂ u (x; θ (0)) #
lim sup #
# − # = 0,
N →∞ t ∈[0, T ] ∂ W (1 ) ∂ W (1) #2
# #
# ∂ u (x; θ (t )) ∂ u (x; θ (0)) #
lim sup # − # = 0.
N →∞ t ∈[0, T ] # ∂ b ( 0) ∂ b(0) #2
Thus, we conclude that
# #
# ∂ u (x; θ (t )) ∂ u (x; θ (0)) #
lim sup # − # = 0.
N →∞ t ∈[0, T ] # ∂θ ∂θ #
2
Now for u xx , we know that

∂ u xx (x; θ ) 1 ...
( 0)
= √ W k(1) W k(0) W k(0) σ ( W k(0) x + bk(0) )x + 2σ̈ ( W k(0) x + bk(0) )
∂Wk N
∂ u xx (x; θ ) 1
(1 )
= √ ( W k(0) )2 σ̈ ( W k(0) x + bk(0) )
∂Wk N
∂ u xx (x; θ ) 1 ...
( 0)
= √ W k(1) ( W k(0) )2 σ ( W k(0) x + bk(0) ).
∂ bk N
24
Then for W (0) , again we define

T
A (t ) = W (1)
B (t ) = W (0)
...
C (t ) = σ ( W (0) x + b(0) )x
D (t ) = 2σ̈ ( W (0) x + b(0) ).
Then,
# #
# ∂ u xx (x; θ (t )) ∂ u xx (x; θ (0)) #
sup # # − #
∂ W ( 0) ∂ W ( 0) #
t ∈[0, T ] 2
# #
# 1 #
= sup # √ # A (t ) B (t ) [B (t ) C (t ) + D (t )] − A (0) B (0) [B (0) C (0) + D (0)] #
N
#
t ∈[0, T ] 2
# #
# 1 #
≤ sup # #√ A (t ) B (t ) B (t ) C (t ) − A (0) B (0) B (0) C (0) # #
t ∈[0, T ] N 2
# #
# 1 #
+ sup # √
# N A (t ) B (t ) D (t ) − A (0 ) B (0 ) D (0 ) #
#
t ∈[0, T ] 2
:= I 1 + I 2 .
For I 1 , we have
# #
# 1 #
sup # √
# N A (t ) B (t ) B (t ) C (t ) − A (0 ) B (0 ) B (0 ) C (0 ) #
#
t ∈[0, T ] 2
# #
# 1 #
≤ sup # √
# N A (t ) − A ( 0 ) B (t ) B (t ) C (t ) #
#
t ∈[0, T ] 2
# #
# 1 #
+ sup # # √ A (0 ) B (t ) B (t ) C (t ) − B (0 ) B (0 ) C (0 ) #
#
t ∈[0, T ] N 2
# #
# 1 #
2 #
≤ sup B (t )∞ C (t )∞ # √ A (t ) − A (0) #
N
#
t ∈[0, T ] 2
# #
# 1 #
#
+ sup A (0)∞ # √ B (t ) B (t ) C (t ) − B (0) B (0) C (0) #
#
t ∈[0, T ] N 2
# # # #
# 1 # # #
sup # √ A (t ) − A (0 ) # + sup # √1 B (t ) B (t ) C (t ) − B (0) B (0) C (0) #
# N # # N #
t ∈[0, T ] 2 t ∈[0, T ] 2
···
# # # #
# 1 # # 1 #
sup # √ # #
A (t ) − A (0) # + sup # √ # B (t ) − B (0) #
N N
#
t ∈[0, T ] 2 t ∈[0, T ] 2
# # # #
# 1 # # 1 #
+ sup # # #
# √ C (t ) − C (0) # + sup # √ D (t ) − D (0) #
#
t ∈[0, T ] N 2 t ∈[0, T ] N 2
−→ 0,
as N → ∞. We can use the same strategy to I 2 as well as W (1) and b(0) . As a consequence, we conclude
# #
# ∂ u xx (x; θ (t )) ∂ u xx (x; θ (0)) #
lim sup #
# − # = 0.
#
N →∞ t ∈[0, T ] ∂θ ∂θ 2
With these lemmas, now we can prove our main Theorem 4.4
N N
Proof of Theorem 4.4. For a given data set {xbi , g (xbi )}i =b1 , {xri , f (xri )}i =r 1 , let J u (t ) and J r (t ) be the Jacobian matrix of
u (xb ; θ (t )) and u xx (xr ; θ ) with respect to θ , respectively,
25
∂ u (xi ; θ (t )) ∂ u (xi ; θ (t ))
b xx r
J u (t ) = , J r (t ) = .
∂θ j ∂θ j
Note that

J u (t ) T
K (t ) = J u (t ), J rT (t ) := J (t ) J T (t ).
J r (t )
This implies that

# #
# #
K (t ) − K (0)2 = # J (t ) J T (t ) − J (0) J T (0)#
2
# T # # #
# # # #
≤ # J (t ) J (t ) − J (0) # + # J (t ) − J (0) J T (0)#
T
2 2
≤ J (t )2 J (t ) − J (0)2 + J (t ) − J (0)2 J (0)2 .
By Lemma D.1, it is easy to show that J (t )2 is bounded. So it now suffices to show that
# #
sup # J u (t ) − J u (0)# F → 0 (D.6)
t ∈[0, T ]
# #
sup # J r (t ) − J r (0)# F → 0, (D.7)
t ∈[0, T ]
as N → ∞. Since the training data is finite, it suffices to consider just two inputs x, x . By the Cauchy-Schwartz inequality,
we obtain
' ( ' (
∂ u (x; θ (t )) ∂ u (x ; θ (t )) ∂ u (x; θ (0)) ∂ u (x ; θ (0))
, − ,
∂θ ∂θ ∂θ ∂θ
' ( ' (
∂ u (x; θ (t )) ∂ u (x ; θ (t )) ∂ u (x ; θ(0)) ∂ u (x; θ (t )) ∂ u (x; θ (0)) ∂ u (x ; θ (0))

≤ −
∂θ
,
∂θ ∂θ + ∂θ
−
∂θ
,
∂θ
# # # # # # # #
# ∂ u (x; θ (t )) # # ∂ u (x ; θ (t )) ∂ u (x ; θ(0)) # # # # #
≤# # # − # + # ∂ u (x; θ (t )) − ∂ u (x; θ (0)) # # ∂ u (x ; θ (0)) # .
# ∂θ # # ∂θ ∂θ # # ∂θ ∂θ # # ∂θ #
2 2 2 2
# #
# ∂ u (x;θ (t )) #
From Lemma D.4, we have # ∂θ # is uniformly bounded for t ∈ [0, T ]. Then using Lemma D.4 again gives
2
' ( ' (
∂ u (x; θ (t )) ∂ u (x ; θ(t )) ∂ u (x; θ (0)) ∂ u (x ; θ(0))
sup , − ,
t ∈[0, T ] ∂θ ∂θ ∂θ ∂θ
# # # #
# ∂ u (x ; θ (t )) ∂ u (x ; θ (0)) # # ∂ u (x; θ (t )) ∂ u (x; θ (0)) #
≤ C sup # # − # # #
∂θ ∂θ # + C sup # ∂θ
−
∂θ #
t ∈[0, T ] 2 t ∈[0, T ] 2
−→ 0,
as N → ∞. This implies that
# #
lim sup # J u (t ) − J u (0)#2 = 0.
N →∞ t ∈[0, T ]
Similarly, we can repeat this calculation for J r , i.e.,

' ( ' (
∂ u xx (x; θ (t )) ∂ u xx (x ; θ (t )) ∂ u xx (x; θ (0)) ∂ u xx (x ; θ(0))
sup , − ,
t ∈[0, T ] ∂θ ∂θ ∂θ ∂θ
# # # #
# ∂ u xx (x; θ (t )) # # ∂ u xx (x ; θ(t )) ∂ u xx (x ; θ(0)) #
≤ sup # #
# #
# # − #
#
t ∈[0, T ] ∂θ 2 ∂θ ∂θ 2
# # # #
# ∂ u xx (x; θ (t )) ∂ u xx (x; θ (0)) # # ∂ u xx (x ; θ (0)) #
+ sup # # − # #
# #
#
#
t ∈[0, T ] ∂θ ∂θ 2 ∂θ 2
# # # #
# ∂ u xx (x ; θ (t )) ∂ u xx (x ; θ(0)) # # ∂ u xx (x; θ (t )) ∂ u xx (x; θ (0)) #
≤ C sup # # − # # #
∂θ ∂θ # + C sup # ∂θ
−
∂θ #
t ∈[0, T ] 2 t ∈[0, T ] 2
−→ 0,
as N → ∞. Hence, we get
26
# #
lim sup # J r (t ) − J r (0)#2 = 0,
N →∞ t ∈[0, T ]
and thus we conclude that
lim sup K (t ) − K (0)2 = 0.

N →∞ t ∈[0, T ]
References
[1] Maziar Raissi, Alireza Yazdani, George Em Karniadakis, Hidden fluid mechanics: learning velocity and pressure fields from flow visualizations, Science
367 (6481) (2020) 1026–1030.
[2] Luning Sun, Han Gao, Shaowu Pan, Jian-Xun Wang, Surrogate modeling for fluid flows based on physics-constrained deep learning without simulation
data, Comput. Methods Appl. Mech. Eng. 361 (2020) 112732.
[3] Maziar Raissi, Hessam Babaee, Peyman Givi, Deep learning of turbulent scalar mixing, Phys. Rev. Fluids 4 (12) (2019) 124501.
[4] Maziar Raissi, Zhicheng Wang, Michael S. Triantafyllou, George Em Karniadakis, Deep learning of vortex-induced vibrations, J. Fluid Mech. 861 (2019)
119–137.
[5] Xiaowei Jin, Shengze Cai, Hui Li, George Em Karniadakis, NSFnets (Navier-Stokes flow nets): physics-informed neural networks for the incompressible
Navier-Stokes equations, arXiv preprint arXiv:2003.06496, 2020.
[6] Francisco Sahli Costabal, Yibo Yang, Paris Perdikaris, Daniel E. Hurtado, Ellen Kuhl, Physics-informed neural networks for cardiac activation mapping,
Front. Phys. 8 (2020) 42.
[7] Georgios Kissas, Yibo Yang, Eileen Hwuang, Walter R. Witschey, John A. Detre, Paris Perdikaris, Machine learning in cardiovascular flows modeling:
predicting arterial blood pressure from non-invasive 4D flow MRI data using physics-informed neural networks, Comput. Methods Appl. Mech. Eng.
358 (2020) 112623.
[8] Zhiwei Fang, Justin Zhan, Deep physical informed neural networks for metamaterial design, IEEE Access 8 (2019) 24506–24513.
[9] Dehao Liu, Yan Wang, Multi-fidelity physics-constrained neural network and its application in materials modeling, J. Mech. Des. 141 (12) (2019).
[10] Yuyao Chen, Lu Lu, George Em Karniadakis, Luca Dal Negro, Physics-informed neural networks for inverse problems in nano-optics and metamaterials,
Opt. Express 28 (8) (2020) 11618–11633.
[11] Sifan Wang, Paris Perdikaris, Deep learning of free boundary and Stefan problems, arXiv preprint, arXiv:2006.05311, 2020.
[12] Yibo Yang, Paris Perdikaris, Adversarial uncertainty quantification in physics-informed neural networks, J. Comput. Phys. 394 (2019) 136–152.
[13] Yinhao Zhu, Nicholas Zabaras, Phaedon-Stelios Koutsourelakis, Paris Perdikaris, Physics-constrained deep learning for high-dimensional surrogate mod-
eling and uncertainty quantification without labeled data, J. Comput. Phys. 394 (2019) 56–81.
[14] Yibo Yang, Paris Perdikaris, Physics-informed deep generative models, arXiv preprint arXiv:1812.03511, 2018.
[15] Luning Sun, Jian-Xun Wang, Physics-constrained Bayesian neural network for fluid flow reconstruction with sparse and noisy data, arXiv preprint
arXiv:2001.05542, 2020.
[16] Liu Yang, Xuhui Meng, George Em Karniadakis, B-PINNs: Bayesian physics-informed neural networks for forward and inverse PDE problems with noisy
data, arXiv preprint arXiv:2003.06097, 2020.
[17] Justin Sirignano, Konstantinos Spiliopoulos, DGM: a deep learning algorithm for solving partial differential equations, J. Comput. Phys. 375 (2018)
1339–1364.
[18] Jiequn Han, Arnulf Jentzen, E. Weinan, Solving high-dimensional partial differential equations using deep learning, Proc. Natl. Acad. Sci. USA 115 (34)
(2018) 8505–8510.
[19] Dongkun Zhang, Ling Guo, George Em Karniadakis, Learning in modal space: solving time-dependent stochastic PDEs using physics-informed neural
networks, SIAM J. Sci. Comput. 42 (2) (2020) A639–A665.
[20] Guofei Pang, Lu Lu, George Em Karniadakis, fPINNs: fractional physics-informed neural networks, SIAM J. Sci. Comput. 41 (4) (2019) A2603–A2626.
[21] Guofei Pang, Marta D’Elia, Michael Parks, George E. Karniadakis, nPINNs: nonlocal physics-informed neural networks for a parametrized nonlocal
universal Laplacian operator. Algorithms and applications, arXiv preprint arXiv:2004.04276, 2020.
[22] Alexandre M. Tartakovsky, Carlos Ortiz Marrero, Paris Perdikaris, Guzel D. Tartakovsky, David Barajas-Solano, Learning parameters and constitutive
relationships with physics informed deep neural networks, arXiv preprint arXiv:1808.03398, 2018.
[23] Lu Lu, Pengzhan Jin, George Em Karniadakis, DeepONet: learning nonlinear operators for identifying differential equations based on the universal
approximation theorem of operators, arXiv preprint, arXiv:1910.03193, 2019.
[24] A.M. Tartakovsky, C. Ortiz Marrero, Paris Perdikaris, G.D. Tartakovsky, D. Barajas-Solano, Physics-informed deep neural networks for learning parameters
and constitutive relationships in subsurface flow problems, Water Resour. Res. 56 (5) (2020) e2019WR026731.
[25] Yeonjong Shin, Jerome Darbon, George Em Karniadakis, On the convergence and generalization of physics informed neural networks, arXiv preprint
arXiv:2004.01806, 2020.
[26] Hamdi A. Tchelepi, Olga Fuks, Limitations of physics informed machine learning for nonlinear two-phase transport in porous media, J. Mach. Learn.
Model. Comput. 1 (1) (2020).
[27] Maziar Raissi, Deep hidden physics models: deep learning of nonlinear partial differential equations, J. Mach. Learn. Res. 19 (1) (2018) 932–955.
[28] Sifan Wang, Yujun Teng, Paris Perdikaris, Understanding and mitigating gradient pathologies in physics-informed neural networks, arXiv preprint
arXiv:2001.04536, 2020.
[29] Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, Aaron Courville, On the spectral bias of neural
networks, in: International Conference on Machine Learning, 2019, pp. 5301–5310.
[30] Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, Quanquan Gu, Towards understanding the spectral bias of deep learning, arXiv preprint arXiv:
1912.01198, 2019.
[31] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, Ren
Ng, Fourier features let networks learn high frequency functions in low dimensional domains, arXiv preprint arXiv:2006.10739, 2020.
[32] Ronen Basri, Meirav Galun, Amnon Geifman, David Jacobs, Yoni Kasten, Shira Kritchman, Frequency bias in neural networks for input of non-uniform
density, arXiv preprint arXiv:2003.04560, 2020.
[33] Arthur Jacot, Franck Gabriel, Clément Hongler, Neural tangent kernel: convergence and generalization in neural networks, in: Advances in Neural
Information Processing Systems, 2018, pp. 8571–8580.
[34] Greg Yang, Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel
derivation, arXiv preprint arXiv:1902.04760, 2019.
27
[35] Alexander G. de, G. Matthews, Mark Rowland, Jiri Hron, Richard E. Turner, Zoubin Ghahramani, Gaussian process behaviour in wide deep neural
networks, arXiv preprint arXiv:1804.11271, 2018.
[36] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein, Deep neural networks as Gaussian pro-
cesses, arXiv preprint arXiv:1711.00165, 2017.
[37] David J.C. MacKay, Introduction to Gaussian Processes, NATO ASI Series F Computer and Systems Sciences, vol. 168, 1998, pp. 133–166.
[38] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Russ R. Salakhutdinov, Ruosong Wang, On exact computation with an infinitely wide neural net, in:
Advances in Neural Information Processing Systems, 2019, pp. 8141–8150.
[39] Maziar Raissi, Paris Perdikaris, George E. Karniadakis, Physics-informed neural networks: a deep learning framework for solving forward and inverse
problems involving nonlinear partial differential equations, J. Comput. Phys. 378 (2019) 686–707.
[40] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, Jeffrey Pennington, Wide neural networks of any
depth evolve as linear models under gradient descent, in: Advances in Neural Information Processing Systems, 2019, pp. 8572–8583.
[41] Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo, Yanyang Xiao, Zheng Ma, Frequency principle: Fourier analysis sheds light on deep neural networks, arXiv
preprint arXiv:1901.06523, 2019.
[42] Basri Ronen, David Jacobs, Yoni Kasten, Shira Kritchman, The convergence rate of neural networks for learned functions of different frequencies, in:
Advances in Neural Information Processing Systems, 2019, pp. 4761–4771.
[43] Lu Lu, Xuhui Meng, Zhiping Mao, George E. Karniadakis, DeepXDE: a deep learning library for solving differential equations, arXiv preprint arXiv:
1907.04502, 2019.
[44] Parviz Moin, Fundamentals of Engineering Numerical Analysis, Cambridge University Press, 2010.
[45] L.C. Evans, American Mathematical Society. Partial Differential Equations, Graduate Studies in Mathematics, American Mathematical Society, 1998.
[46] Xavier Glorot, Yoshua Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International
Conference on Artificial Intelligence and Statistics, 2010, pp. 249–256.
[47] Diederik P. Kingma, Jimmy Ba Adam, A method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014.
[48] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, Andrew Rabinovich, GradNorm: gradient normalization for adaptive loss balancing in deep multitask
networks, in: International Conference on Machine Learning, PMLR, 2018, pp. 794–803.
[49] Haowen Xu, Hao Zhang, Zhiting Hu, Xiaodan Liang, Ruslan Salakhutdinov, Eric Xing, AutoLoss: learning discrete schedules for alternate optimization,
arXiv preprint arXiv:1810.02442, 2018.
[50] A. Ali Heydari, Craig A. Thompson, Asif Mehmood, SoftAdapt: techniques for adaptive loss weighting of neural networks with multi-part loss functions,
arXiv preprint arXiv:1912.12355, 2019.
28

NTK!!!

Uploaded by

Copyright:

Available Formats

NTK!!!

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NTK!!!

Uploaded by

Copyright:

Available Formats

Journal of Computational Physics 449 (2022) 110768

Contents lists available at ScienceDirect

Journal of Computational Physics

When and why PINNs fail to train: A neural tangent kernel

2. Inﬁnitely wide neural networks

where σ̇ denotes the derivative of the activation function σ .

3. Physics-informed neural networks (PINNs)

N [u ](x) = f (x), x∈ (3.1)

r (x; θ ) := N [u ](x; θ) − f (x). (3.3)

3.1. Neural tangent kernel theory for PINNs

Proof. The proof of Lemma 3.1 is given in Appendix A.

as N → ∞, where D means convergence in distribution and

Proof. The proof can be found in Appendix B.

Proof. The proof can be found in Appendix C.

where C does not depend on N.

lim sup  K (t ) − K (0)2 = 0, (4.8)

where K (t ) is the corresponding NTK of PINNs.

Proof. The proof can be found in Appendix D.

5. Spectral bias in physics-informed neural networks

Algorithm 1: Adaptive weights for physics-informed neural networks.

L(θ) := λb Lb (θ) + λr Lr (θ),

where λi (n), λuu

θ n+1 = θ n − η∇θ L(θ n ) (6.7)

7.1. Convergence of the NTK of PINNs

f (x) = −π 2 sin(π x), x ∈ [0, 1]

7.2. Adaptive training for PINNs

f (x) = −16π 2 sin(4π x), x ∈ [0, 1]

7.3. One-dimensional wave equation

utt (x, t ) − 4u xx (x, t ) = 0, (x, t ) ∈ (0, 1) × (0, 1) (7.5)

First, by d’Alembert’s formula [45], the solution u (x, t ) is given by

L(θ ) = λu Lu (θ ) + λut Lut (θ ) + λr Lr (θ) (7.10)

CRediT authorship contribution statement

Declaration of competing interest

Appendix A. Proof of Lemma 3.1

Now let us consider the corresponding gradient ﬂow

It follows that for 0 ≤ j ≤ N b ,

Then we can rewrite the above equations as

Appendix B. Proof of Theorem 4.1

Since σ̈ is bounded, we may assume that |σ̈ | ≤ C . Then we have

This concludes the proof.

Appendix C. Proof of Theorem 4.3

and θ = ( W (0) , W (1) , b(0) , b(1) ). Then we have

as N → ∞, where (1) (x, x ) is deﬁned in equation (2.4).

Then plugging all these together we obtain

as N → ∞. This formula is also consistent with equation (2.8).

It is then easy to compute that

By the law of large numbers, letting N → ∞ gives

Now, recall that

 du (x; θ (0)) du (x ; θ (0)) 

 du (x; θ(t )) du (x ; θ (t )) 

 du (x; θ(t )) du (x ; θ (t ))  P

Appendix D. Proof of Theorem 4.4

Before we prove the main theorem, we need to prove a series of lemmas.

Lemma D.1. Under the setting of Theorem 4.4, for l = 0, 1, we have

Proof. For the given model problem, recall that

Lemma D.2. Under the setting of Theorem 4.4, we have

Thus, by assumptions and Lemma D.1, for l = 0, 1 we have

for l = 1, 2. Similarly, applying the same strategy to b (0) we can show

This concludes the proof.

N [u ](x) = f (x), x∈ (3.1)

lim sup K (t ) − K (0)2 = 0, (4.8)

as N → ∞, where (1) (x, x ) is deﬁned in equation (2.4).

du (x; θ (0)) du (x ; θ (0))

du (x; θ(t )) du (x ; θ (t ))

du (x; θ(t )) du (x ; θ (t )) P

lim sup K (t ) − K (0)2 = 0.