1 Introduction

Partial Differential Equations (PDEs) describe continuous phenomena, such as the fluid flows or the propagation of waves. They correspond to the mathematical translation of observable, physical, chemical or biological processes. Unfortunately, these equations rarely admit analytical solutions, and need to be discretised on some mesh. This process can be computationally expensive, especially for high-dimensional problems and when unstructured meshes are required, for example to account for local irregular behaviours. This discretised scheme can then be solved using a variety of numerical methods, such as finite elements (FEM), finite differences (FDM) or the finite volume (FVM). However, even these methods can be inefficient for large and complex problems. For example, the solution of the Navier–Stokes equations, describing the motions of a fluid, can require millions of hours of CPU or GPU time on a supercomputer. Another example is the Poisson equation, one of the most important PDEs in engineering, including heat conduction, gravitation, and electrodynamics. Solving it numerically in high dimension is only tractable with iterative methods, which often do not scale well with dimension and/or require specialist knowledge when dealing with boundary conditions or when generating the discretisation mesh.

Neural networks (NNs) are well positioned to solve such complicated PDEs and are already used in various areas of engineering and applied mathematics for complex regression and image-to-image translation tasks. The scientific computing community has applied them PDE solving as early as the 1980 s [20], yet interest has exploded over recent years, due in part to significant improvements in computational techniques and improvements in the formulation of such networks, as detailed and highlighted for example in [4, 21, 32].

Quantum computing is a transformative new paradigm which takes advantage of the quantum phenomena seen at the microscopic physical scale. While significantly more challenging to design, quantum computers can run specialised algorithms that scale better than their classical counterparts, sometimes exponentially faster. Quantum computers are made of quantum bits (or qubits) that, unlike bits in conventional digital computers, store and process data based on two key principles of quantum physics: quantum superposition and quantum entanglement. They characteristically suffer from specific errors, namely quantum errors, which are related to the quantum nature of their qubits. Even if quantum computers of sufficient complexity are not yet available, there is a clear need to understand what tasks we can hope to perform thereon and to design methods to mitigate the effects of quantum errors [29].

Quantum neural networks form a new class of machine learning networks and leverage quantum mechanical principles such as superposition and entanglement with the potential to deal with complicated problems and/or high-dimensional spaces. Suggested architectures for quantum neural networks include [7, 11, 34] and suggest that there might be potential advantages, including faster training. Preliminary theoretical research into quantum machine learning shows that quantum networks can produce a more trainable model [1]. This is particularly relevant to solving PDEs with machine learning as techniques which produce a more favourable loss landscape can drastically improve the performance of these models [13, 18].

In the present work, we suggest a new way of formulating a quantum neural network, translate some classical machine learning techniques to the quantum setting and develop a complexity analysis in the context of specific PDEs (the Heat, the Poisson and an HJB equation). This provides a framework to demonstrate the potential and the versatility of quantum neural networks as PDE solvers.

The paper is organised as follows: Sect. 2 introduces the PINN algorithm and reviews the basics of classical and quantum networks. In Sect. 3, we introduce a novel quantum network to solve specific PDEs and provide a complexity analysis. Finally, we confirm the quality of the scheme numerically in Sect. 4.

Notations. We denote by \({\mathcal {C}}^{n}(\Omega ,{\mathbb {R}})\) the space of n times differentiable functions from \(\Omega \) to \({\mathbb {R}}\), by \(L^p(\Omega )\) the space of functions with finite \(L^p\) norm and define the Sobolev space \({\mathcal {W}}^{\alpha ,\beta }(\Omega ): = \left\{ f \in L^\beta (\Omega ): \nabla ^s f \in L^\beta (\Omega ) \text { for all }|s| \le \alpha \right\} \), where \(\nabla ^s f\) refers to the weak derivative of order s. Similarly, we let \({\mathcal {W}}_0^{\alpha ,\beta }\) be the subspace of functions in \({\mathcal {W}}^{\alpha ,\beta }(\Omega )\) that vanish in the trace sense on the boundary of \(\Omega \). We use \(\Vert \cdot \Vert \) to refer to the Euclidean norm.

2 Main tools

2.1 PINN algorithm

The Physics-informed neural network (PINN) algorithm relies on the expressive power of neural networks to solve PDEs. Let \(\Omega \subset {\mathbb {R}}^{d}\) be a bounded Lipschitz domain with boundary \(\partial \Omega \), \({\mathscr {F}}: {\mathcal {C}}^{K}(\Omega ,{\mathbb {R}}) \rightarrow {\mathcal {C}}(\Omega ,{\mathbb {R}})\) a differential operator of order at most K, \(E \subset \partial \Omega \) and \(h: E \rightarrow {\mathbb {R}}\). The goal is to estimate the solution \(u: \Omega \rightarrow {\mathbb {R}}\) to the PDE

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathscr {F}}\left( u\right) (x) = 0, & \text {for all } x\in \Omega , \\ u(x)=h(x), & \text {for all } x\in E. \end{array}\right. } \end{aligned}$$

Let \(u_\Theta : \Omega \rightarrow {\mathbb {R}}\) be a neural network at least K times continuously differentiable parameterised by some set \(\Theta \). We assume access to two datasets: independent and identically distributed (iid) random vectors \(\{X^{(e)}_i\}_{i=1,\ldots ,n_e}\) with known distribution \(\mu _E\) on E and iid random vectors \(\{X^{(r)}_i\}_{i=1,\ldots , n_r}\) with known distribution \(\mu _\Omega \) on \(\Omega \). We then minimise the empirical loss function

$$\begin{aligned} {\mathcal {L}}_{n_e,n_r}(u_\Theta ):= \frac{\lambda _e}{n_e}\sum _{i=1}^{n_e}\left| u_\Theta \left( X^{(e)}_i\right) - h\left( X^{(e)}_i\right) \right| + \frac{1}{n_r} \sum _{i = 1}^{n_r} \left| {\mathscr {F}}\left( u_\Theta \right) \left( X^{(r)}_i\right) \right| ,\qquad \quad \end{aligned}$$
(2.1)

over the set \(\Theta \) using a hybrid quantum-classical training loop, where \(\lambda _e >0\) is a hyper-parameter allowing one to balance the two loss components. This training loop evaluates all \(u_\Theta (\cdot )\) values on a quantum computer before feeding the values to a classical computer for use in classical optimisation routines. As shown in [5] (Proposition 3.2 and the associated discussion) minimising (2.1) does not necessarily imply anything about the mean squared error \(||u - u_\Theta ||^2_2\).

2.2 Gaussian smoothing

In [16], the authors investigated Gaussian smoothing the output of a classical neural network for use as a PDE trial solution, providing a simpler expression (as an expectation) for all derivatives. Consider indeed a trial solution of the form

$$\begin{aligned} f(x):= {\mathbb {E}}_{\delta \sim {\mathcal {N}}(0,\textrm{I}\sigma ^{2})}[u(x+ \delta )], \qquad \text {for all }x\in {\mathbb {R}}^d, \end{aligned}$$
(2.2)

for some \(\sigma >0\), where u is the output of a neural network. Then, assuming u measurable, all derivatives can then be written as follows:

Theorem 2.2.1

([16, Theorem 1]) For any measurable function \(u: {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}\), the function f defined in (2.2) is differentiable and the following holds for all \(x\in {\mathbb {R}}^d\):

$$\begin{aligned} \nabla f(x) = \frac{1}{\sigma ^2} {\mathbb {E}}_{\delta \sim {\mathcal {N}}(0,\textrm{I}\sigma ^{2})} \left[ \delta u(x+ \delta )\right] . \end{aligned}$$

Theorem 2.2.1 implies that an (unbiased) estimate for the gradient can be computed for example by Monte Carlo, for example as

$$\begin{aligned} \widetilde{\nabla f}(x):= \frac{1}{K} \sum _{k=1}^{K} \frac{\delta _k}{\sigma ^2}u(x+ \delta _k), \end{aligned}$$

using K iid Gaussian \({\mathcal {N}}(0,\textrm{I}\sigma ^{2})\) samples \((\delta _k)_{k=1,\dots ,K}\). This can be improved using a combination of antithetic variable and control variate techniques (see for example [9, Chapter 4] for a thorough overview) resulting in the new estimator

$$\begin{aligned} \widetilde{\nabla f}(x) := \frac{1}{K} \sum _{k=1}^{K} \frac{\delta _k}{2 \sigma ^2}\Big (u(x+ \delta _k) - u(x- \delta _k)\Big ). \end{aligned}$$

This method can easily be extended to derivatives of any order with recursion, for example for the Laplacian:

$$\begin{aligned} \widetilde{\Delta f}(x):= \frac{1}{K} \sum _{k=1}^{K} \frac{\Vert \delta _k\Vert ^2 - 2\sigma ^2 }{2\sigma ^4} \Big (u(x + \delta _k) - 2u(x) + u(x-\delta _k)\Big ). \end{aligned}$$

In fact, the function f defined in (2.2) is Lipschitz continuous:

Theorem 2.2.2

Let \(u:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\) be measurable. The map \({\mathbb {E}}_{\delta \sim {\mathcal {N}}(0,\textrm{I}\sigma ^{2})}[u(\cdot + \delta )]\) is Lipschitz with respect to any arbitrary norm. In particular with the 2-norm, it is \(\frac{{\mathfrak {u}}}{\sigma }\sqrt{\frac{2}{\pi }}\) Lipschitz, with \({\mathfrak {u}}= \sup _{x \in {\mathbb {R}}^d} |u(x)|\).

Proof

This statement is proved in [16] for the \(\ell _2\)-norm, and it is easy to extend to any arbitrary norm. Since f is differentiable by Theorem 2.2.1, it is well known that its Lipschitz constant \(L^{\alpha }(f)\) in the \(\alpha \)-norm can be obtained as

$$\begin{aligned} L^{\alpha }(f) = \sup _{x \in {\mathbb {R}}^d} \Vert \nabla _x f(x)\Vert _{\alpha }^{*}, \end{aligned}$$

with \(\Vert z\Vert _{\alpha }^{*}:= \sup _{\Vert y\Vert _{\alpha }\le 1}y^\top z\) the dual norm of \(\Vert \cdot \Vert _\alpha \). Using Theorem 2.2.1, we can write

$$\begin{aligned} L^{\alpha }(f) = \sup _{x \in {\mathbb {R}}^d} \left\Vert {\mathbb {E}}\left[ \frac{\delta }{\sigma ^2} u(x+ \delta ) \right] \right\Vert _{\alpha }^{*}&= \frac{1}{\sigma ^2}\sup _{x \in {\mathbb {R}}^d} \sup _{\Vert y\Vert _{\alpha }\le 1}\left| y^\top {\mathbb {E}}\left[ \delta u(x+ \delta ) \right] \right| \\&= \frac{1}{\sigma ^2}\sup _{x \in {\mathbb {R}}^d} \sup _{\Vert y\Vert _{\alpha }\le 1} \left| {\mathbb {E}}\left[ y^\top \delta u(x+ \delta ) \right] \right| \\&\le \frac{{\mathfrak {u}}}{\sigma ^2} \sup _{\Vert y\Vert _{\alpha }\le 1} \left| {\mathbb {E}}\left[ y^\top \delta \right] \right| \\&\le \frac{{\mathfrak {u}}}{\sigma ^2} \sup _{\Vert y\Vert _{\alpha }\le 1} {\mathbb {E}}\left[ \left| y^\top \delta \right| \right] \le \frac{{\mathfrak {u}}}{\sigma ^2} \sup _{\Vert y\Vert _{\alpha }\le 1}\sigma \sqrt{\frac{2\Vert y\Vert _2}{\pi }}. \end{aligned}$$

When \(\alpha = 2\), the upper bound becomes \(\frac{{\mathfrak {u}}}{\sigma } \sqrt{\frac{2}{\pi }}\) and the theorem follows. \(\square \)

As concluded in [16], this Lipschitz constant restriction is not particularly limiting in the classical setting since small values of \(\sigma \) can be used. However, in the quantum setting small values of \(\sigma \) introduce a large error if not enough shots are used. Specifically, consider the parameterised expectation (detailed in (2.3))

$$\begin{aligned} g(x):= \langle {\mathcal {C}}\rangle _{\texttt{M}(x)|{\varvec{0}}\rangle } = \langle {\varvec{0}}| \texttt{M}(x)^{\dagger }{\mathcal {C}}\texttt{M}(x)|{\varvec{0}}\rangle \in {\mathbb {C}}, \qquad \text {for }x\in {\mathbb {R}}^d. \end{aligned}$$

On actual quantum hardware we can only obtain an estimate \({\widetilde{g}}(x)\) of g(x) using a finite number of shots, and we denote

$$\begin{aligned} \varepsilon (x):= g(x) - {\widetilde{g}}(x) \end{aligned}$$

the (pointwise) inaccuracy, which is a random variable, since \({\widetilde{g}}(x)\) is an empirical estimator. This framework allows for error from both noisy circuits and estimating expectations using a finite number of shots. Since the map g is bounded and the numbers of shots and qubits are finite, then there exists a constant \(\varepsilon _f > 0\) such that \(| \varepsilon (x)| \le \varepsilon _f\) uniformly over \(x \in {\mathbb {R}}^d\). Define \(G(x):= {\mathbb {E}}[g(x+ \delta )]\) and \({\widetilde{G}}(x):= {\mathbb {E}}\left[ {\widetilde{g}}(x+ \delta )\right] \), where \(\delta _{\sigma } \sim {\mathcal {N}}(0,\textrm{I}\sigma ^2)\), the following lemma provides a bound for the distance between the gradients of these two functions:

Lemma 2.2.3

The following bound holds for the Euclidean norm:

$$\begin{aligned} \sup _{x\in {\mathbb {R}}^d}\left\| \nabla G(x) - \nabla {\widetilde{G}}(x) \right\| \le \frac{\sqrt{d}\, \varepsilon _f}{ \sigma }. \end{aligned}$$

Proof

From Theorem 2.2.1, we can write, for any \(x\in {\mathbb {R}}^d\),

$$\begin{aligned} \left\| \nabla G(x) - \nabla {\widetilde{G}}(x) \right\|&= \left\| \frac{1}{\sigma ^2}{\mathbb {E}}[\delta _{\sigma } g(x+\delta _{\sigma })] - \frac{1}{\sigma ^2}{\mathbb {E}}\left[ \delta _{\sigma }{\widetilde{g}}(x+\delta _{\sigma })\right] \right\| \\&= \frac{1}{\sigma }\Big \Vert {\mathbb {E}}[\delta _{1} g(x+\delta _{1}\sigma )] - {\mathbb {E}}\left[ \delta _{1}{\widetilde{g}}(x+ \delta _{1}\sigma )\right] \Big \Vert \\&= \frac{1}{\sigma }\left\| {\mathbb {E}}[\delta _{1}\varepsilon (x+\delta _{1}\sigma )] \right\| \\&\le \frac{1}{\sigma } {\mathbb {E}}\Vert \delta _{1}\varepsilon (x+ \delta _{1}\sigma )\Vert \le \frac{\varepsilon _f}{\sigma }{\mathbb {E}}\Vert \delta _{1}\Vert =\frac{\varepsilon _f \sqrt{2}}{\sigma } \frac{\Gamma (\frac{d+1}{2})}{\Gamma (\frac{d}{2})} \le \frac{\sqrt{d}\,\varepsilon _f}{\sigma }. \end{aligned}$$

We use Jensen’s inequality to take the norm inside of the expectation. The penultimate inequality follows since the Euclidean norm of the Gaussian is a Chi distribution, and the last inequality is Gautschi’s [31, Equation (6)]. \(\square \)

Therefore we see that decreasing the parameter \(\sigma \) increases the impact of quantum induced sampling error. Similar reasoning can be applied to finite difference based methods, and we refer the reader to [2, Section 2.1] for a general review of finite differences for gradients with error.

2.3 Random classical networks

We call ‘random classical network’ a single-hidden-layer feedforward neural network with randomly generated internal weights, where only the last layer of weights and hyperparameters is optimised over. Such random networks have previously been successfully applied to solving different types of (high-dimensional) PDEs [10, 17, 24, 30]. For the Black-Scholes-type PDE (similar to the heat equation), Gonon [10] provided a full error analysis of the prediction error of random neural networks. Specifically, let \(N \in {\mathbb {N}}\) be the number of hidden nodes, \({\varvec{B}}= (B_1,\cdots ,B_N)\) an iid sequence of real-valued random variables and \({\varvec{E}}=(E_1,\cdots ,E_N)\) another iid sequence of random variables in \({\mathbb {R}}^{d}\), independent of \({\varvec{B}}\). For a vector \({\varvec{W}}\in {\mathbb {R}}^N\), define the random function

$$\begin{aligned} H_{{\varvec{W}}}^{{\varvec{E}}, {\varvec{B}}}({\varvec{x}}):=\sum _{i=1}^N W_i \varrho \left( E_i x + B_i\right) , \qquad \text {for all }x \in {\mathbb {R}}^d, \end{aligned}$$

where we consider the nonlinear activation function \(\varrho (x) = \max (0,x)\). The vector \({\varvec{W}}\) then represents the trainable parameters, while \({\varvec{E}}\) and \({\varvec{B}}\) are sampled from some prior distribution and frozen. These random networks are considerably easier to train than traditional fully connected deep neural networks, especially in a supervised learning context, where training reduces to a linear regression. The PINN algorithm is not a supervised learning approach, and this therefore does not apply; however it does reduce the number of trainable parameters, and hence the computational burden. In [12], Gonon, Grigoryeva and Ortega proved that, as long as the unknown function is sufficiently regular, it is possible to draw the internal weights of the random network from a generic distribution (not depending on the unknown object) and quantify the error in terms of the network architecture.

2.4 Quantum neural networks

2.4.1 Quantum neural networks

Using a quantum network for the PDE trial solution in the PINN algorithm was first proposed by Kyriienko, Paine and Elfving [19]. Essentially, the spatial variable x is embedded into a quantum state via a unitary operator \(\texttt{U}(x)\) (referred to as the feature map), then a parameterised unitary operator \(\texttt{U}_\Theta \) (independent of x) is applied before producing the output of the network by taking the expectation of a Hermitian cost operator \({\mathcal {C}}\):

$$\begin{aligned} u_\Theta (x):= \langle {\mathcal {C}}\rangle _{\texttt{U}_\Theta \, \texttt{U}(x) |{\varvec{0}}\rangle } = \langle {{\varvec{0}}|\left( \texttt{U}_\Theta \, \texttt{U}(x) \right) ^{\dagger }}{\mathcal {C}}\, \texttt{U}_\Theta \, \texttt{U}(x) |{\varvec{0}}\rangle , \end{aligned}$$

and the parameters \(\Theta \) are found by minimising some loss function, for example (2.1). In [19], the authors suggested that increasing the number of qubits or using cost operators with more complex Pauli decompositions could produce more expressive networks. Preliminary research [26,27,28] has shown the potential of parameterising the feature maps and repeated application of unitary operators, leading to the more general formulation

$$\begin{aligned} u_\Theta (x) = \langle {{\varvec{0}}|\left( \prod _{i=L} ^{1} \texttt{U}_{i,\theta _i} \texttt{U}_{i,\omega _i}(x) \right) ^{\dagger }}{\mathcal {C}}\left( \prod _{i=L}^{1} \texttt{U}_{i,\theta _i} \texttt{U}_{i,\omega _i}(x) \right) |{\varvec{0}}\rangle , \end{aligned}$$

for some positive integer L, where \(\Theta = \{ \theta _1,\omega _1,\cdots ,\theta _L,\omega _L\}\) is the set of all hyperparameters. In particular, the authors in [28] showed that one-qubit networks can act as universal approximators for bounded complex continuous functions or integrable functions with a finite number of finite discontinuities.

2.4.2 Random quantum networks

Consider a system with n qubits. Let \(\varvec{A}: {\Omega } \rightarrow {\mathbb {C}}^{2^n \times 2^n}\) be a random function which maps the spatio-temporal variable x to a unitary matrix and \(\varvec{\Lambda } \in {\mathbb {C}}^{2^n \times 2^n}\) a random unitary matrix distributed according to the Haar measure. Then for a suitable set of parameters \(\Theta \), we define the random function \(u^{\varvec{\Lambda },\varvec{A}}_\Theta : \Omega \rightarrow {\mathbb {R}}\),

$$\begin{aligned} u^{\varvec{\Lambda },\varvec{A}}_\Theta (x):= \langle {{\varvec{0}}| \Bigl ( \texttt{U}_{\Theta }(x)\varvec{\Lambda }\varvec{A}(x) \Bigl ) ^{\dagger }}{\mathcal {C}}\Bigl ( \texttt{U}_{\Theta }(x)\varvec{\Lambda }\varvec{A}(x)\Bigl )|{\varvec{0}}\rangle , \end{aligned}$$

where \({\mathcal {C}}\) is a Hermitian cost operator. When using this random quantum neural network to approximate u we generate \(\varvec{A}\) and \(\varvec{\Lambda }\), consider them fixed and train the parameters \(\Theta \). Specific examples of \(\varvec{A}\) are given in Sect. 3.1.

Remark 2.4.1

In practice, one may encode the data x only through \(\varvec{A}\) and leave \(\texttt{U}_{\Theta }\) independent of x. We leave the current formulation as is, allowing for more generality.

2.4.3 Optimised parameter shift

When differentiating parameterised expectation values we apply the family of parameter shift rules discussed by Mari, Bromley and Killoran [23]. In quantum computing, one-qubit rotation gates can be written as \(\exp \left\{ -\frac{\textrm{i}x}{2} \texttt{G}\right\} \) for some unitary matrix \(\texttt{G}\) and some real number x. We shall require here a slightly modified version:

Definition 2.4.2

For \(x\in {\mathbb {R}}\), the matrix \(\texttt{M}(x)\) is a single-qubit rotation-like gate if

$$\begin{aligned} \texttt{M}(x) = \exp \left\{ -\frac{\textrm{i}x}{2} \texttt{G}\right\} , \end{aligned}$$

for some complex-valued involutory generator matrix \(\texttt{G}\) (satisfying \(\texttt{G}^2 = \textrm{I}\)).

The matrix \(\texttt{M}(x)\), for x real, represents a rotation of angle x around the axis spanned by \(\texttt{G}\). Where an arbitrary single-qubit gate can be written using a generator matrix that is Hermitian, therefore here we are restricting to a smaller set of quantum gates.

We shall use these single-qubit rotation-like gates to construct an approximation \(u^{\varvec{\Lambda },\varvec{A}}_\Theta \). The following lemma allows us to decompose the unitary conjugation \(\texttt{M}(x)^{\dagger }\, {\mathcal {C}}\, \texttt{M}(x)\) to the sum of three easy-to-compute terms, as mentioned in [23, Equation (5)], but without proof:

Lemma 2.4.3

For any \(x\in {\mathbb {R}}\), the identity \(\texttt{M}(x)^{\dagger }{\mathcal {C}}\texttt{M}(x) = \texttt{A}+ \texttt{B}\cos (x) + \texttt{C}\sin (x)\) holds for any single-qubit rotation-like gate \(\texttt{M}(x)\) with

$$\begin{aligned} \texttt{A}= \frac{\texttt{G}^\dagger {\mathcal {C}}\texttt{G}+ {\mathcal {C}}}{2},\qquad \texttt{B}= \frac{{\mathcal {C}}- \texttt{G}^\dagger {\mathcal {C}}\texttt{G}}{2},\qquad \texttt{C}= \frac{\textrm{i}}{2}\left( \texttt{G}^\dagger {\mathcal {C}}- {\mathcal {C}}\texttt{G}\right) . \end{aligned}$$

Proof

Let \(x\in {\mathbb {R}}\). Since \(\texttt{M}(x)\) is a rotation-like gate as in Definition 2.4.2 with involutory generator \(\texttt{G}\), then it is trivial to show that

$$\begin{aligned} \texttt{M}(x) = \exp \left\{ -\frac{\textrm{i}x}{2} \texttt{G}\right\} = \cos \left( \frac{x}{2}\right) \textrm{I}- \textrm{i}\sin \left( \frac{x}{2}\right) \texttt{G}. \end{aligned}$$

Therefore

$$\begin{aligned} \texttt{M}^{\dagger }{\mathcal {C}}\texttt{M}&= \Big (\cos \left( \frac{x}{2}\right) \textrm{I}- \textrm{i}\sin \left( \frac{x}{2}\right) \texttt{G}\Big )^\dagger {\mathcal {C}}\Big (\cos \left( \frac{x}{2}\right) \textrm{I}- \textrm{i}\sin \left( \frac{x}{2}\right) \texttt{G}\Big )\\&= \Big (\cos \left( \frac{x}{2}\right) \textrm{I}+ \textrm{i}\sin \left( \frac{x}{2}\right) \texttt{G}^\dagger \Big ){\mathcal {C}}\Big (\cos \left( \frac{x}{2}\right) \textrm{I}- \textrm{i}\sin \left( \frac{x}{2}\right) \texttt{G}\Big )\\&= \left[ \cos \left( \frac{x}{2}\right) ^2{\mathcal {C}}+ \sin \left( \frac{x}{2}\right) ^2\texttt{G}^\dagger {\mathcal {C}}\texttt{G}\right] + \textrm{i}\left[ \sin \left( \frac{x}{2}\right) \cos \left( \frac{x}{2}\right) \texttt{G}^\dagger {\mathcal {C}}- \sin \left( \frac{x}{2}\right) \cos \left( \frac{x}{2}\right) {\mathcal {C}}\texttt{G}\right] \\&= \left[ \cos \left( \frac{x}{2}\right) ^2{\mathcal {C}}+ \left( 1-\cos \left( \frac{x}{2}\right) ^2\right) \texttt{G}^\dagger {\mathcal {C}}\texttt{G}\right] + \frac{\textrm{i}}{2}\left( \texttt{G}^\dagger {\mathcal {C}}- {\mathcal {C}}\texttt{G}\right) \sin (x)\\&= \texttt{G}^\dagger {\mathcal {C}}\texttt{G}+ \left( {\mathcal {C}}- \texttt{G}^\dagger {\mathcal {C}}\texttt{G}\right) \cos \left( \frac{x}{2}\right) ^2 + \frac{\textrm{i}}{2}\left( \texttt{G}^\dagger {\mathcal {C}}- {\mathcal {C}}\texttt{G}\right) \sin (x)\\&= \texttt{G}^\dagger {\mathcal {C}}\texttt{G}+ \left( {\mathcal {C}}- \texttt{G}^\dagger {\mathcal {C}}\texttt{G}\right) \frac{1+\cos (x)}{2} + \frac{\textrm{i}}{2}\left( \texttt{G}^\dagger {\mathcal {C}}- {\mathcal {C}}\texttt{G}\right) \sin (x)\\&= \frac{\texttt{G}^\dagger {\mathcal {C}}\texttt{G}+ {\mathcal {C}}}{2} +\frac{{\mathcal {C}}- \texttt{G}^\dagger {\mathcal {C}}\texttt{G}}{2}\cos (x) + \frac{\textrm{i}}{2}\left( \texttt{G}^\dagger {\mathcal {C}}- {\mathcal {C}}\texttt{G}\right) \sin (x), \end{aligned}$$

and the lemma follows. \(\square \)

Consider the function \(g:{\mathbb {R}}\rightarrow {\mathbb {R}}\) defined as

$$\begin{aligned} g(x):= \langle {{\varvec{0}}| \texttt{M}(x)^{\dagger }\, {\mathcal {C}}\, \texttt{M}(x)|{\varvec{0}}}\rangle , \qquad \text {for any }x\in {\mathbb {R}}, \end{aligned}$$
(2.3)

given some single-qubit rotation-like matrix \(\texttt{M}\) as in Definition 2.4.2. Clearly g is infinitely differentiable, and it is furthermore periodic by Lemma 2.4.3. Denote its partial derivatives

$$\begin{aligned} g_{j_1,j_2,\cdots ,j_d} (x):= \frac{\partial ^d g(x)}{\partial x_{j_1} \partial x_{j_2}\cdot \cdot \cdot \partial x_{j_d}}, \qquad \text {for any }x\in {\mathbb {R}}. \end{aligned}$$

Applying the family of parameter shift rules [23, Equation (9)] results in

$$\begin{aligned} g_{j} (x) = \frac{g(x+s{\varvec{e}}_j)-g(x-s{\varvec{e}}_j)}{2\sin (s)}, \qquad \text {for any }x\in {\mathbb {R}}, \end{aligned}$$
(2.4)

for any \(s \in {\mathbb {R}}\setminus \{k\pi :k\in {\mathbb {Z}}\}\), where \({\varvec{e}}_j\) is a unit vector along the \(x_j\) axis. Iteratively applying this rule with the same shift results in

$$\begin{aligned} g_{j_1,j_2}(x) = \frac{g(x+s({\varvec{e}}_{j_1}+{\varvec{e}}_{j_2}))- g(x+s(-{\varvec{e}}_{j_1}+{\varvec{e}}_{j_2})) -g(x+s({\varvec{e}}_{j_1} - {\varvec{e}}_{j_2}))+g(x-s({\varvec{e}}_{j_1}+{\varvec{e}}_{j_2}))}{4\sin (s)^2}, \end{aligned}$$

for any \(x\in {\mathbb {R}}\). For \(j_1 = j_2\) and using the value \(s=\frac{\pi }{2}\) reduces this to

$$\begin{aligned} g_{j,j} (x) = \frac{g(x+\pi {\varvec{e}}_j)-g(x)}{2}, \end{aligned}$$
(2.5)

and for \(s = \frac{\pi }{4}\) we obtain

$$\begin{aligned} g_{j,j} (x) = \frac{g\left( x+{\varvec{e}}_j\frac{\pi }{2}\right) -2g(x) + g\left( x-{\varvec{e}}_j\frac{\pi }{2}\right) }{2}. \end{aligned}$$
(2.6)

While (2.5) only requires the evaluation of two expectation values compared to the three of (2.6), the latter has the distinct advantage of providing both the derivatives \(g_{j,j}\) and \(g_{j}\). For the complexity analysis we apply either (2.5) or (2.6) depending on which one is more efficient for the chosen PDE. The optimised parameter shift rules above are covered in [23], but simple yet tedious computations can provide higher other derivatives if needed, as shown in the following:

Theorem 2.4.4

Let g be any function of the form (2.3) with \(\texttt{M}\) a single-qubit rotation-like gate and \({\mathcal {C}}\) a Hermitian cost operator. For any \(d \ge 2\) and all \(x \in {\mathbb {R}}\),

$$\begin{aligned} \frac{\partial ^d g(x)}{\partial x_j ^d}= & \left[ 2\sin \left( \frac{\pi }{d}\right) \right] ^{-d}\\ & \left\{ \left( 1 + (-1)^d\right) g(x + \pi {\varvec{e}}_j) + \sum _{i=1}^{d-1} (-1)^{i+d}{d \atopwithdelims (){i}}g\left( x + \left( \frac{2\pi i}{d}-\pi \right) {\varvec{e}}_j\right) \right\} . \end{aligned}$$

Proof

Repeated applications of the parameter shift rule (2.4) with the same basis vector and the same shift magnitude for each shift results in

$$\begin{aligned} \frac{\partial ^d}{\partial x_j ^d} g(x) = \sum _{i=0}^{d} (-1)^{i+d}{d \atopwithdelims (){i}}\frac{g\left( x + (\frac{2\pi i}{d}-\pi ){\varvec{e}}_j\right) }{(2\sin (\pi /d))^d}. \end{aligned}$$

The first and last terms in the expansion are

$$\begin{aligned} \frac{1}{(2\sin (\pi /d))^d}\left( (-1)^d g(x-\pi {\varvec{e}}_j) + g(x+\pi {\varvec{e}}_j)\right) . \end{aligned}$$

By periodicity of our expectation function, for d even this reduces to \(\frac{2g(x+\pi {\varvec{e}}_j)}{(2\sin (\pi /d))^d}\), whereas when d is odd the first and last terms in the expansion cancel. \(\square \)

We see that we have the ability to find a \(d^{\text {th}}\) order unmixed partial derivative with just d evaluations of the parameterised expectation. Quantum automatic differentiation engines can be built using a combination of the chain rule and the above formulae where the argument of each expectation is calculated on a classical computer, the expectation is calculated on a quantum computer before the result is fed back to the classical computer to compile the different values in the correct way to construct the derivative. This suggested method is in sharp contrast to the automatic differentiation procedures carried out on classical neural networks where different orders of derivatives can only be calculated sequentially. For traditional automatic differentiation the algorithm has to first build the computational graph for first-order derivatives and then perform back-propagation on the graph for the second-order derivatives before this process is repeated. Note that previous publications such as [19] suggested repeated applications of the basic parameter shift rule which does not benefit from the computational acceleration discussed above.

3 Random network architecture and complexity analysis

3.1 Random network architecture

We assume the input data x is scaled to the interval [0, 1]. By repeated applications of the so-called UAT (Universal Approximation) gate, one-qubit circuits have the ability to approximate any bounded complex function on the unit hypercube [28]. This UAT gate is defined as

$$\begin{aligned} \texttt{U}^{\text {UAT}}_\Theta (x):= \texttt{R}_{y}(2\varphi )\texttt{R}_{z}(2\gamma ^\top x +2\alpha ), \qquad \text {for all } x \in [0,1]^d, \end{aligned}$$

with \(\Theta = (\varphi ,\gamma ,\alpha ) \in {\mathbb {R}}\times {\mathbb {R}}^d\times {\mathbb {R}}\). This was extended in [26] to multiple qubits by applying the UAT gate to each qubit followed by an entangling layer. We create the random network’s trainable layer using this UAT gate as well as an entangling layer, which can be repeated several times. For the entangling layer we choose the sequence

$$\begin{aligned} \prod _{i= n -1}^{0}\texttt{C}_{\texttt{NOT}}(q_i,q_{i+1 \, \textrm{mod} \, n}), \end{aligned}$$

of controlled NOT gates, where \(\texttt{C}_{\texttt{NOT}}(q_i,q_j)\) denotes the controlled NOT gate acting on qubit \(q_j\) with control qubit \(q_i\). The whole trainable M-layer circuit then reads

$$\begin{aligned} \texttt{U}_{\Theta }(x):= \prod _{j= 1}^{M}\left( \prod _{i= n -1}^{0}\texttt{C}_{\texttt{NOT}}(q_i,q_{i+1 \, \textrm{mod} \, n})\right) \left( \bigotimes _{i= 0}^{n -1} \texttt{U}^{\text {UAT}}_{\Theta _{i,j}}(x)\right) , \end{aligned}$$
(3.1)

and we write \(\Theta = \{\Theta _{i,j}\}_{0 \le i \le n-1, 1 \le j \le M }\). For the ‘encoding’ matrix \(\varvec{A}(x)\) introduced in Sect. 2.4.2, we use

$$\begin{aligned} \varvec{A}(x):= \bigotimes _{j = 1 }^{n} \texttt{R}_{z}\left[ {\mathcal {X}}_j\pi {{\,\mathrm{\textrm{acos}}\,}}\left( \frac{3}{2} \left( x_{{\mathfrak {g}}_{j}}-\frac{1}{2}\right) \right) \right] , \end{aligned}$$
(3.2)

where \(({\mathcal {X}}_j)_{j=1,\ldots ,n}\) are iid \({\mathcal {U}}(0,1)\) and an index \({\mathfrak {g}}_{j}:= 1 + (j+1 \mod d)\) with nonlinear activation function \({{\,\mathrm{\textrm{acos}}\,}}(\cdot )\). The constants inside \({{\,\mathrm{\textrm{acos}}\,}}(\cdot )\) and the constant \(\pi \) inside \(\texttt{R}_{z}\) are justified experimentally as arguments of \(\texttt{R}_{z}\) away from \(\pm \pi \) lead to better results. Such a choice of encoding matrix is inspired from [19], who show that the resulting matrix \(\varvec{A}(x)\) can then be written as a Chebyshev polynomial, the order of which increases with n, known for approximating well nonlinear functions. Alternative choices for the activation function can be considered with minor modifications as is standard in classical NN literature [6]. However, based on the theoretical reasoning provided above and corroborated by numerical experiments, we found that \({{\,\mathrm{\textrm{acos}}\,}}\) typically produced better results when compared to other approaches.

Finally, we use a specific Ising Hamiltonian with transverse and longitudinal magnetic fields and homogeneous Ising couplings for the cost operator:

$$\begin{aligned} {\mathcal {C}}= \sum _j \left[ \widehat{\texttt{Z}}_{j (\textrm{mod}\, n) }\widehat{\texttt{Z}}_{j+1 (\textrm{mod}\, n)} + \widehat{\texttt{Z}}_{j (\textrm{mod}\, n)} + \widehat{\texttt{X}}_{j(\textrm{mod}\, n)} \right] , \end{aligned}$$
(3.3)

where \(\texttt{X}\) and \(\texttt{Z}\) are the usual one-qubit Pauli gates and the index refers to the qubit index they act upon. This is a relatively complex cost operator, allowing us to approximate a wide class of functions, better than the simpler cost operator \(\sum _j \widehat{\texttt{Z}}_j\). Therefore the output of the quantum network is

$$\begin{aligned} u_\Theta (x):= \langle {\mathcal {C}}\rangle _{\texttt{U}_{\Theta }(x) |{\varvec{0}}\rangle } = \langle {{\varvec{0}}|\left( \texttt{U}_{\Theta }(x) \right) ^{\dagger }}{\mathcal {C}}\, \texttt{U}_{\Theta }(x) |{\varvec{0}}\rangle . \end{aligned}$$
(3.4)

3.2 Complexity analysis

Define the auxiliary variables \(\alpha _i\) and \(\alpha _{i,j}\) as the number of single-qubit rotation-like gates respectively with \(x_i\) dependence and with \(x_i\) and \(x_j\) dependence in the circuit for \(u_\Theta \). These are not uniquely defined values since the decomposition into single-qubit rotation-like gates is not unique. Given the loss function (2.1), define the quantity \(\xi ({\mathcal {L}}_{n_e,n_r})\), which returns the total number of \(u_\Theta \) evaluations needed to calculate the loss function \({\mathcal {L}}_{n_e,n_r}\), and which we will use as our complexity metric. Let \(N_{\Theta }\) be the total number of components in \(\Theta \). It may seem pertinent to include in this metric the cost of computing all the derivatives \((\partial _{\Theta _i}{\mathcal {L}})_{i=1,\ldots ,N_{\Theta }}\) (for example for the optimisation part); this is however unnecessary since the exact number of \(u_\Theta \) evaluations to do so is given by \(2N_{\Theta }\xi ({\mathcal {L}})\).

Note that the application of an efficient automatic differentiation engine to a classical neural network will almost surely result in a more efficient scaling of \(\xi ({\mathcal {L}})\) with respect to the number of trainable parameters. For example, given a function mapping \({\mathbb {R}}^n\) to \({\mathbb {R}}\) it takes at most c evaluations of the function to calculate the entire Jacobian matrix, where c is guaranteed to be under 6 and is typically between 2 and 3. [14].

3.3 The p-Laplace equation

Consider the p-Laplace (\(1< p < \infty \)) equation with Dirichlet boundary conditions:

$$\begin{aligned} {\left\{ \begin{array}{ll} \Delta _p u + f = 0, & \text {on }\Omega , \\ u=h, & \text {on }\partial \Omega , \end{array}\right. } \end{aligned}$$
(3.5)

where \(u \in {\mathcal {W}}_0^{1, p}(\Omega ) \cap L^p(\Omega )\), \(f \in L^{\frac{p}{p-1}}(\Omega )\) and h the trace of some \({\mathcal {W}}_0^{1, p}(\Omega ) \cap L^p(\Omega )\) function. Where the p-Laplace operator reads

$$\begin{aligned} \Delta _p u:= |\nabla u|^{p-4}\left( |\nabla u|^{2}\Delta u + (p-2)\sum _{i,j =1}^{d} \frac{\partial u}{\partial x_i}\frac{\partial u}{\partial x_j}\frac{\partial ^2 u}{\partial x_i \partial x_j}\right) . \end{aligned}$$
(3.6)

For the variational formulation we have [3, Lemma (2.3)]Footnote 1

$$\begin{aligned} u \in \mathop {\mathrm {\arg \!\min }}\limits _{ \begin{array}{c} v \in {\mathcal {W}}^{1,p}(\Omega ) \\ \textrm{T}v = h \end{array}} \left( \frac{1}{p}\int _\Omega |\nabla v|^p \textrm{d}x - \int _\Omega f v \textrm{d}x\right) . \end{aligned}$$
(3.7)

We refer the reader to [8] for a reference on the weak formulation of PDEs. To translate the variational statement (3.7) to a loss function, we add a penalisation term for the boundary conditions and approximate the integral along the sample points, resulting in the empirical loss function

$$\begin{aligned} {\mathcal {L}}^{\text {Var}}_{n_e,n_r}(u_\Theta ):= & \frac{\lambda _e}{n_e}\sum _{i=1}^{n_e}\left| u_\Theta \left( X^{(e)}_i\right) - h\left( X^{(e)}_i\right) \right| \nonumber \\ & + \frac{1}{n_r}\left( \frac{1}{p} |\nabla u_\Theta (X^{(r)}_i)|^p - f(X^{(r)}_i) u_\Theta (X^{(r)}_i) \right) . \end{aligned}$$
(3.8)

This idea of minimising a functional to solve PDEs using neural networks has previously been studied for Poisson equations [25, 33]. The cases \(p\ne 2\) and \(p=2\) need to be studied separately since the mixed second-order partial derivatives of the p-Laplace operator (3.6) cancel when \(p=2\).

3.3.1 The \(p \ne 2\) case

Lemma 3.3.1

For the prototypical PINN loss function (2.1), we have

$$\begin{aligned} \xi ({\mathcal {L}}_{n_e,n_r}) = n_r \sum _{i,j = 1, i \le j}^{d} \left( 4\alpha _i \alpha _j - \alpha _{i,j} \right) + n_e, \end{aligned}$$

provided the second-order partial derivatives all commute, namely when

$$\begin{aligned} \left[ \frac{\partial v_{\Theta }}{\partial x_i},\frac{\partial v_{\Theta }}{\partial x_j} \right] = 0, \quad \text {for all } i,j \le d. \end{aligned}$$

Without any assumptions on the partial derivatives, we have

$$\begin{aligned} \xi ({\mathcal {L}}_{n_e,n_r}) = n_r \sum _{i,j = 1}^d \left( 4\alpha _i \alpha _j - \alpha _{i,j} \right) + n_e. \end{aligned}$$

Proof

Consider decomposing the quantum circuit (3.1) responsible for producing \(u_\Theta (x)\) into one with just single qubit rotation-like gates and CNOTs. Assuming p single-qubit rotation gates and with a slight abuse of notation,

$$\begin{aligned} u_\Theta (x) = g(\phi _1(x),\cdots ,\phi _{p} (x)) = g(\phi (x)), \end{aligned}$$
(3.9)

where each \(\phi _i\) is the rotation angle for a particular single qubit rotation-like gate. Note that this decomposition is clearly not unique, as one could choose a decomposition based on which has the lowest \(\xi \) value. Basic applications of the chain rule then yield

$$\begin{aligned} \frac{\partial u_\Theta }{\partial x_i}&= \sum _a \frac{\partial g}{\partial \phi _a } \frac{\partial \phi _a}{\partial x_i},\nonumber \\ \frac{\partial u_\Theta }{\partial x_i \partial x_j}&= \sum _{a,b} \frac{\partial g}{\partial \phi _b \partial \phi _a}\frac{\partial \phi _a}{\partial x_i}\frac{\partial \phi _b}{\partial x_j} + \sum _a \frac{\partial g}{\partial \phi _a} \frac{\partial \phi _a}{\partial x_i \partial x_j}. \end{aligned}$$
(3.10a)

We then split the sum over ab up resulting in

$$\begin{aligned} \frac{\partial u_\Theta }{\partial x_i \partial x_j} = \sum _{a \ne b} \frac{\partial g}{\partial \phi _b \partial \phi _a}\frac{\partial \phi _a}{\partial x_i}\frac{\partial \phi _b}{\partial x_j} + \sum _a \frac{\partial g}{\partial \phi _a\partial \phi _a}\frac{\partial \phi _a}{\partial x_i}\frac{\partial \phi _a}{\partial x_j} + \sum _a \frac{\partial g}{\partial \phi _a} \frac{\partial \phi _a}{\partial x_i \partial x_j}. \end{aligned}$$

In the first summation there are \((\alpha _i \alpha _j - \alpha _{i,j})\) terms, so applying the standard parameter shift rule equation (2.4) twice results in \(4(\alpha _i \alpha _j - \alpha _{i,j})\) evaluations of \(u_\Theta \). In the second summation there are \(\alpha _{a,b}\) terms, so the optimised parameter shift rule (2.6) results in \(3\alpha _{i,j}\) evaluations of \(u_\Theta \). We use this parameter shift rule as it provides all of the quantum gate partial derivatives needed for the third summation too.

For the p-Laplace operator (3.6), we require all mixed partial derivatives, that is the derivatives for all pairs (ij); as a result the number of operators needed to evaluate all these derivatives of \(u_\Theta \) is equal to

$$\begin{aligned} \sum _{i,j=1}^d \left( 4(\alpha _i \alpha _j - \alpha _{i,j}) + 3\alpha _{i,j} \right) = \sum _{i,j=1}^d \left( 4\alpha _i \alpha _j - \alpha _{i,j} \right) , \end{aligned}$$

where we have assumed the derivatives do not necessarily commute. Assuming they do, we then only need to sum over pairs with \(i \le j\), as

$$\begin{aligned} \sum _{i,j = 1, i \le j}^d \left( 4\alpha _i \alpha _j - \alpha _{i,j} \right) . \end{aligned}$$

Note that the \(n_r\) factor comes from the number of times the residual must be evaluated; since boundary data appears with no derivatives, each sample then only requires one \(u_\Theta \) evaluation. \(\square \)

Lemma 3.3.2

For the variational loss function formulation (3.8),

$$\begin{aligned} \xi \left( {\mathcal {L}}^{\text {Var}}_{n_e,n_r}\right) = n_r\left( 1 + 2\sum _{i=1}^{d} \alpha _i\right) + n_e. \end{aligned}$$
(3.11)

Proof

The minimisation statement that arises from the variational formulation (3.7) involves both the gradient and the function value for each sample point. Using the same decomposition as before (3.9), the chain rule (3.10a) and the basic parameter shift rule (2.4) the gradient involves exactly \(2\sum _{i=1}^{d} \alpha _i\) evaluations of \(u_\Theta \). The function value results in only one \(u_\Theta \) evaluation. This is done for each sample point inside the domain resulting in

$$\begin{aligned} \xi \left( {\mathcal {L}}^{\text {Var}}_{n_e,n_r}\right) = n_r\left( 1 + 2\sum _{i=1}^{d} \alpha _i\right) . \end{aligned}$$

The extra \(n_e\) term in (3.11) is a result of the boundary penalisation term in (3.8). \(\square \)

Lemma 3.3.3

If the PDE trial solution is the Gaussian smoothed output of a quantum network then, using K classical samples of Gaussian noise,

$$\begin{aligned} \xi \left( {\mathcal {L}}_{n_e,n_r}^{\text {Smooth}}\right) = K(5n_r+n_e). \end{aligned}$$
(3.12)

Proof

When the output of the quantum network is Gaussian smoothed (2.2), the Hessian matrix \(\varvec{H} f_\Theta \) can be calculated using

$$\begin{aligned} \varvec{H} f_\Theta (x) ={\mathbb {E}}\left[ \left( \frac{\delta \delta ^{\top }-\sigma ^2 \textrm{I}}{\sigma ^4}\right) u_\Theta (x+\delta )\right] , \qquad \text {for all }x \in \Omega , \end{aligned}$$
(3.13)

with \(\delta \sim {\mathcal {N}}\left( 0, \sigma ^2 \textrm{I}\right) \), as proven in [16]. The general variance of this estimator can be reduced by applying the control variate and antithetic variable method. For the additive control variate we use \(u_\Theta (x)\) as a baseline which turns the estimate (3.13) into

$$\begin{aligned} \varvec{H} f_\Theta (x) ={\mathbb {E}}\left[ \left( \frac{\delta \delta ^{\top }-\sigma ^2 \textrm{I}}{\sigma ^4}\right) \Big \{u_\Theta (x+\delta )-u_\Theta (x)\Big \}\right] , \qquad \text {for all }x \in \Omega .\qquad \end{aligned}$$
(3.14)

just as authors do in [16]. Notice the estimate in (3.13) and (3.14) is invariant when \(\delta \) is substituted with \(-\delta \), averaging this new estimator with the one in (3.14) results in

$$\begin{aligned} \varvec{H} f_\Theta (x) ={\mathbb {E}}\left[ \left( \frac{\delta \delta ^{\top }-\sigma ^2 \textrm{I}}{2\sigma ^4}\right) \Big \{u_\Theta (x+\delta )+ u_\Theta (x-\delta ) - 2u_\Theta (x)\Big \}\right] . \end{aligned}$$

The same technique can be applied to the estimator for the gradient resulting in

$$\begin{aligned} \nabla _x f_\Theta (x) = {\mathbb {E}}\left[ \frac{\delta }{2 \sigma ^2}(u_\Theta (x+\delta )-u_\Theta (x-\delta ))\right] .\qquad \end{aligned}$$

There are a total of five \(u_\Theta \) terms in both expressions, the boundary term will feature a single \(f_\Theta \) term for each sample point. Assuming we approximate each classical expectation with K iid Gaussian samples the expression (3.12) follows. \(\square \)

Lemma 3.3.4

We can combine a Gaussian smoothed trial solution with the variational loss function reformulation (3.8) and (3.7) resulting in

$$\begin{aligned} \xi \left( {\mathcal {L}}_{n_e,n_r}^{\text {Var,Smooth}}\right) = K(3n_r+n_e). \end{aligned}$$

Proof

For each sample point the variational formulation (3.7) only involves a gradient term and the function value and, proceeding as in the proof of Lemma 3.3.3, this produces 3K evaluations of \(u_\Theta \), where we again assume K iid Gaussian samples. Once more the boundary penalisation element shown in (3.8) produces the \(Kn_e\) term. \(\square \)

3.3.2 The \(p=2\) case

The \(p=2\) Laplace equation reduces to the Poisson equation, and the following holds:

Lemma 3.3.5

For the Poisson equation, the complexity metrics for the prototypical loss formulation (2.1) and the loss function using the variational form (3.8) read

$$\begin{aligned} \xi ({\mathcal {L}}_{n_e,n_r})&= n_r\left( \sum _{i=1}^d \left( 4\alpha ^2_i - \alpha _i \right) \right) + n_e, \\ \xi \left( {\mathcal {L}}^{\text {Var}}_{n_e,n_r}\right)&= n_r\left( 1 + 2\sum _{i=1}^{d} \alpha _i\right) + n_e. \end{aligned}$$

Proof

The proof of \(\xi \left( {\mathcal {L}}^{\text {Var}}_{n_e,n_r}\right) \) is the exact same as the case with \(p \ne 2\). The first statement’s proof is very similar to that in Lemma 3.3.1 however we include it for the sake of completeness. Using the same decomposition as in Lemma 3.3.1 we have

$$\begin{aligned} \frac{\partial u_\Theta }{\partial x_i \partial x_j} = \sum _{a \ne b} \frac{\partial g}{\partial \phi _b \partial \phi _a}\frac{\partial \phi _a}{\partial x_i}\frac{\partial \phi _b}{\partial x_i} + \sum _a \frac{\partial g}{\partial \phi _a \partial \phi _a}\frac{\partial \phi _a}{\partial x_i}\frac{\partial \phi _a}{\partial x_i} + \sum _a \frac{\partial g}{\partial \phi _a}\frac{\partial }{\partial x_i} \frac{\partial \phi _a}{\partial x_i}. \end{aligned}$$

For the first term we apply the basic parameter shift rule twice to the \(\alpha ^2_i - \alpha _i\) terms resulting in \(4(\alpha ^2_i - \alpha _i)\) evaluations of \(u_\Theta \). For the second term we apply the optimised parameter shift rule given in (2.6), producing \(3\alpha _i\) evaluations of \(u_\Theta \) and also providing all of the quantum derivatives needed for the third term. Summing over all i,

$$\begin{aligned} \sum _{i=1}^d \left( 4(\alpha ^2_i - \alpha _i) + 3\alpha _i\right) =\sum _{i=1}^d \left( 4\alpha ^2_i - \alpha _i \right) . \end{aligned}$$

We then add the boundary term and multiply by the relevant sample sizes to produce the original statement. \(\square \)

Lemma 3.3.6

If the PDE trial solution is Gaussian smoothed, then

$$\begin{aligned} \xi \left( {\mathcal {L}}_{n_e,n_r}^{\text {Smooth}}\right) = K(3n_r + n_e). \end{aligned}$$
(3.15)

Proof

The Gaussian Laplacian term is given by [16, Equation (11)]

$$\begin{aligned} \Delta f_\Theta (x) = {\mathbb {E}}\left[ \left( \frac{\Vert \delta \Vert ^2-\sigma ^2 d}{2 \sigma ^4}\right) (u_\Theta (x+\delta )+u_\Theta (x-\delta )-2 u_\Theta (x))\right] . \end{aligned}$$
(3.16)

There are a total of three \(u_\Theta \) terms in (3.16), the boundary term will feature a single \(f_\Theta \) term for each sample point. Assuming we approximate each classical expectation with K iid Gaussian samples the expression (3.15) follows. \(\square \)

3.4 Heat equation

We now consider the heat equation, solution to the system

$$\begin{aligned} {\left\{ \begin{array}{ll} \left( \Delta - \partial _{t}\right) u(t,x) = 0, & \text {for all }x \in \Omega , t \in [0,T),\\ u(0,x) = h(x), & \text {for all } x \in \Omega ,\\ u(t,x) = g(x), & \text {for all }x \in \partial \Omega , t \in [0,T). \end{array}\right. } \end{aligned}$$

Similarly to above, we define \(\alpha _t\) to be the number single qubit rotation-like gates with time dependence.

Lemma 3.4.1

The complexity with the loss function (2.1) reads

$$\begin{aligned} \xi ({\mathcal {L}}_{n_e,n_r})&= n_r\left( 2\alpha _t+\sum _{i=1}^d \left( 4\alpha ^2_i - \alpha _i \right) \right) + n_e. \end{aligned}$$

If the network is Gaussian smoothed then

$$\begin{aligned} \xi \left( {\mathcal {L}}_{n_e,n_r}^{\text {Smooth}}\right) = K(5n_r+n_e), \end{aligned}$$
(3.17)

Proof

For the first statement we apply the working from proof of the p-Laplace with \(p=2\) followed by the basic parameter shift rule for \(\partial _{t}u_\Theta \) which produces the \(2\alpha _t\) factor. For the Gaussian statement we use the Laplacian term in (3.16) and the corresponding variance reduced equation reads

$$\begin{aligned} \partial _{t}f_\Theta ={\mathbb {E}}_{(\delta _t,\delta _x) \sim {\mathcal {N}}(0,\textrm{I}\sigma ^{2})} \left[ \frac{\delta _t}{2 \sigma ^2}\Big (u_\Theta (t + \delta _t,x+\delta _x)-u_\Theta (t - \delta _t,x-\delta _x)\Big )\right] \qquad \quad \end{aligned}$$
(3.18)

for the time derivative. Counting terms we arrive at (3.17). \(\square \)

Since the other components of the Gaussian noise in (3.18) all return first-order derivatives, then

$$\begin{aligned} (\partial _{t}f_\Theta ,\nabla _x f_\Theta )={\mathbb {E}}_{(\delta _t,\delta _x) \sim {\mathcal {N}}(0,\textrm{I}\sigma ^{2})}\left[ \frac{(\delta _t,\delta _x)}{2 \sigma ^2}(u_\Theta (t + \delta _t,x+\delta _x)-u_\Theta (t - \delta _t,x-\delta _x))\right] . \end{aligned}$$

3.5 Hamilton–Jacobi equation

Consider the classical linear-quadratic Gaussian (LQG) control problem in d dimensions, with the associated HJB equation given by [15]

$$\begin{aligned} {\left\{ \begin{array}{ll} \left( \partial _{t}u + \Delta u - \mu \Vert \nabla _x u\Vert ^2\right) (t,x) = 0, & \text {for all } x \in {\mathbb {R}}^{d} , t \in [0,T),\\ u(T,x) = h(x), & \text {for all }x\in {\mathbb {R}}^{d}. \end{array}\right. } \end{aligned}$$
(3.19)

with \(\mu \in {\mathbb {R}}\). Similarly to the p-Laplace case above, we have the following:

Lemma 3.5.1

In this HJB framework, the complexity with the standard loss function (2.1) reads

$$\begin{aligned} \xi ({\mathcal {L}}_{n_e,n_r})&= n_r\left( 2\alpha _t+ 3\sum _{i=1}^{d} \alpha _i(2\alpha _i -1)\right) + n_e, \end{aligned}$$
(3.20)

which simplifies, for a Gaussian smoothed network, to

$$\begin{aligned} \xi \left( {\mathcal {L}}_{n_e,n_r}^{\text {Smooth}}\right) = K(5n_r+n_e), \end{aligned}$$
(3.21)

Proof

We apply the same decomposition and parameter shift applications as is done in Lemma 3.3.5 and note that doing so provides all of the quantum expectations needed to calculate \(\nabla _x u_\Theta \) too. The basic parameter shift rule is applied to find \(\partial _{t}u_\Theta \), after multiplying each term by the number of relevant samples and adding the boundary element we recover (3.20). For the Gaussian expression we apply (3.20) and (3.16), counting terms adding the boundary term and multiplying by the relevant sample sizes produces (3.21). \(\square \)

3.6 Explicit comparison

With the random network introduced in Sect. 3.1, assuming that the number n of qubits is a multiple of the problem dimension d, we then have, for each \(i, j \in 1,\cdots ,d\),

$$\begin{aligned} \alpha _i = \frac{n}{d} + Mn \qquad \text {and}\qquad \alpha _{i,j} = Mn. \end{aligned}$$

For this particular class of networks, all second-order partial derivatives commute. For the p-Laplace equation with \(p \ne 2\) and without Gaussian smoothing the complexity in Lemma 3.3.1 reads

$$\begin{aligned} \xi ({\mathcal {L}}_{n_e,n_r})&= n_r \left[ \sum _{i,j = 1, i \le j}^d 4\left[ \frac{n}{d} + Mn\right] ^2 - Mn \right] + n_e\\&= \frac{n_r}{2}d(d+1)\left[ 4\left[ \frac{n}{d} + Mn\right] ^2 - Mn\right] + n_e,\\ \xi \left( {\mathcal {L}}^{\text {Var}}_{n_e,n_r}\right)&= n_r\left( 1 + 2\sum _{i=1}^{d} \alpha _i\right) + n_e = n_r\left( 1 + d(d+1)\left( \frac{n}{d} + Mn\right) \right) + n_e. \end{aligned}$$

Let \({\mathcal {L}}^{\text {Var, Smooth}}_{n_e,n_r}\) represent the loss function (3.8) using Gaussian smoothing (2.2) in its variational form (3.7). In this case, we have

$$\begin{aligned} \xi ({\mathcal {L}}_{n_e,n_r}^{\text {Smooth}}) = K(5n_r+n_e) \qquad \text {and}\qquad \xi \left( {\mathcal {L}}^{\text {Var,Smooth}}_{n_e,n_r}\right) = K(3n_r+n_e). \end{aligned}$$

Fix two final UAT trainable layers (\(M=2\)) and let the number of qubits be three times the problem’s dimension (\(n=3d\)). The number of evaluations then reduces to

$$\begin{aligned} \displaystyle \xi \left( {\mathcal {L}}_{n_e,n_r}\right)= & \displaystyle n_rd(d+1)(2(3 + 6d)^2 - 3d) + n_e = \displaystyle n_r(72d^4 + 141d^3\\ & + 87d^2 + 18d )+n_e,\\ \displaystyle \xi \left( {\mathcal {L}}^{\text {Var}}_{n_e,n_r}\right)= & \displaystyle n_r\left( 1 + d(d+1)(3 + 6d)\right) + n_e\\ = & \displaystyle n_r(6d^3 + 9d^2 + 3d + 1) + n_e. \end{aligned}$$

While still polynomial we clearly have better scaling in the dimension for the variational formulation. We in particular have \(\xi ({\mathcal {L}}_{n_e,n_r}) > \xi \left( {\mathcal {L}}^{\text {Var}}_{n_e,n_r}\right) \) for all values of \(d \in {\mathbb {N}}\). In [16], the authors find good experimental results using values of \(K\in [256,2048]\) for classical PINNs, so we now fix \(K = 1024\). For \(n_r = n_e\), we have

$$\begin{aligned} \begin{array}{rll} \displaystyle \xi \left( {\mathcal {L}}^{\text {Var, Smooth}}_{n_e,n_r}\right) & \displaystyle<\xi \left( {\mathcal {L}}_{n_e,n_r}^{\text { Smooth}}\right) , & \text {for all }d\in {\mathbb {N}},\\ \displaystyle \xi \left( {\mathcal {L}}^{\text {Var, Smooth}}_{n_e,n_r}\right) & \displaystyle< \xi \left( {\mathcal {L}}^{\text {Var}}_{n_e,n_r}\right) , & \text {if and only if }d \ge 9,\\ \displaystyle \xi \left( {\mathcal {L}}_{n_e,n_r}^{\text { Smooth}}\right) & \displaystyle < \xi \left( {\mathcal {L}}^{\text {Var}}_{n_e,n_r}\right) , & \text {if and only if }d \ge 10. \end{array} \end{aligned}$$

For the p-Laplace equation, Gaussian smoothing is thus more efficient as the dimension grows.

Remark 3.6.1

Using the complexity analysis in the previous sections, a similar comparison analysis is straightforward for the p-Laplace equation with \(p=2\) as well as for the HJB and the Heat equations.

3.7 Expressive power of smoothed networks

From Theorem 2.2.2, when the output of the quantum network is Gaussian smoothed, the resulting PDE trial solution is \(\frac{{\mathfrak {u}}}{\sigma }\sqrt{\frac{2}{\pi }}\) Lipschitz, where \({\mathfrak {u}}= \sup _{x \in {\mathbb {R}}^d} |u_\Theta (x)|\) with respect to the 2-norm. To derive an upper bound for \({\mathfrak {u}}\) for any parameter set it suffices to consider the range of the expectation for the Hermitian cost (3.4) operator

$$\begin{aligned} \sup _{x,\Theta } |u_\Theta (x)| = \sup _{x,\Theta } |\langle {\mathcal {C}}\rangle _{\texttt{U}_{\Theta }(x) |{\varvec{0}}\rangle }| \le \sup _{\psi , \langle \psi | \psi \rangle = 1}\left|\langle {\mathcal {C}}\rangle _{\psi }\right|, \end{aligned}$$

using the Ising Hamiltonian (3.3)

$$\begin{aligned} \sup _{\psi , \langle \psi | \psi \rangle = 1}\left|\langle {\mathcal {C}}\rangle _{\psi }\right|\le \sup _{\psi , \langle \psi | \psi \rangle = 1}\left|\left\langle \sum _{j=1}^n \left[ \widehat{\texttt{Z}}_{j (\textrm{mod}\, n) }\widehat{\texttt{Z}}_{j+1 (\textrm{mod}\, n)} + \widehat{\texttt{Z}}_{j (\textrm{mod}\, n)} + \widehat{\texttt{X}}_{j(\textrm{mod}\, n)} \right] \right\rangle _{\psi }\right|\le 3n, \end{aligned}$$

resulting in a Lipschitz constant of \(\frac{3n}{\sigma }\sqrt{\frac{2}{\pi }}\) for the PDE trial solution, unlikely to be the best Lipschitz constant. For example for a single qubit it is easy to see that the expectation of \(\texttt{Z}\) will not be 1 when the expectation of \(\texttt{X}\) is 1.

3.8 Lipschitz constant for the heat equation

3.8.1 Heat equation with small Lipschitz constant

Consider the heat equation defined as

$$\begin{aligned} \left\{ \begin{array}{rll} \Delta u(t,x) & = \partial _{t}u(t,x), & \text {for all }x \in {\mathcal {B}}_{0,1}, t \in [0,T),\\ u(0,x) & = \displaystyle \frac{\Vert x\Vert ^2}{2d}, & \text {for all } x \in {\mathcal {B}}_{0,1},\\ u(t,x) & = \displaystyle t + \frac{1}{2d}, & \text {for all }x \in \partial {\mathcal {B}}_{0,1}, t \in [0,T). \end{array} \right. \end{aligned}$$

with \({\mathcal {B}}_{0,1}\), the unit ball in \({\mathbb {R}}^d\). The solution is \(u(t,x) = t + \frac{\Vert x\Vert ^2}{2d}\), with Lipschitz constant

$$\begin{aligned} L^{\alpha }(u) = \sup _{x \in {\mathcal {B}}_{0,1}, t \in [0,T) } \left\| \nabla u (t,x)\right\| _{2} = \sup _{x \in {\mathcal {B}}_{0,1}} \left\| \frac{x}{d} + 1 \right\| _{2} = \frac{1}{d} + 1. \end{aligned}$$

In our formulation, \(n \le d\), where n is the number of qubits and the data lies in \({\mathbb {R}}^d\). Therefore, the Lipschitz constant \(L^{\alpha }(u)\) is smaller than that of the PDE trial solution, \(\frac{3n}{\sigma }\sqrt{\frac{2}{\pi }}\), obtained in Sect. 3.7 as long as \(\sigma \le 3n\sqrt{\frac{2}{\pi }}(1+\frac{1}{d})^{-1}\). Good experimental results with \(\sigma \le 0.1\) were outlined in [16], which clearly satisfies the condition for all integers n and d.

3.8.2 Heat equation with large Lipschitz constant

For an example of PDE where this Gaussian method is intractable with quantum networks, consider the heat equation

$$\begin{aligned} \Delta u(t,x) = \partial _{t}u(t,x), \end{aligned}$$
(3.22)

for \(x \in [0,1]^d\) and \(t \in [0,T)\), with Dirichlet boundary conditions, and consider the solution given by

$$\begin{aligned} u(t,x) = d^{\frac{1}{d}} \textrm{e}^{-a^2\pi ^2 d t}\prod _{i=1}^{d} \sin (a\pi x_i), \end{aligned}$$
(3.23)

for \(x=(x_1, \ldots , x_d)\), and \(\nabla _{t,x}\) the gradient with respect to both \(t \text { and } x\) we have the Lipschitz constant upper bound

$$\begin{aligned} L^{\alpha }(u)&= \sup _{x \in [0,1]^d, t \in [0,T) } \left\| \nabla _{t,x}u({t,x}) \right\| _{2} \\&= \sup _{x \in [0,1]^d, t \in [0,T) }\bigg \Vert -\varvec{e}_1 a^2\pi ^2 d u(t,x) + a\pi d^{\frac{1}{d}}\textrm{e}^{-a^2\pi ^2 d t}\sum _{j=1}^d \varvec{e}_{j+1}\cos (a\pi x_j) \prod _{i=1,i \ne j}^{d} \sin (a\pi x_i) \bigg \Vert _2 \\&= \sup _{x \in [0,1]^d} \bigg \Vert -\varvec{e}_1 a^2\pi ^2d u(0,x) + a\pi d^{\frac{1}{d}}\sum _{j=1}^d \varvec{e}_{j+1}\cos (a\pi x_j) \prod _{i=1,i \ne j}^{d} \sin (a\pi x_i)\bigg \Vert _2\\&\ge \bigg \Vert -\varvec{e}_1 a^2\pi ^2du\left( 0,\frac{1}{2}\right) + a\pi d^{\frac{1}{d}}\sum _{j=1}^d \varvec{e}_{j+1}\cos \left( \frac{a\pi }{2}\right) \prod _{i=1,i \ne j}^{d} \sin \left( \frac{a\pi }{2}\right) \bigg \Vert _2\\&= \bigg \Vert -\varvec{e}_1 a^2\pi ^2d u\left( 0,\frac{1}{2}\right) + a\pi d^{\frac{1}{d}}\sum _{j=1}^d \varvec{e}_{j+1}\cos \left( \frac{a\pi }{2}\right) \ \sin \left( \frac{a\pi }{2}\right) ^{d-1}\bigg \Vert _2\\&=\sqrt{\bigg (a^2\pi ^2 du\left( 0,\frac{1}{2}\right) \bigg )^2 + \left( a\pi \cos \left( \frac{a\pi }{2}\right) d^{\frac{1}{d}}d\sin \left( \frac{a\pi }{2}\right) ^{d-1}\right) ^2}, \end{aligned}$$

where \(\varvec{e}_{j}\) denotes the unit vector in \({\mathbb {R}}^d\) with 1 for component j and 0 otherwise. With \(d = 50\) and \(a = 1\), the resulting Lipschitz constant is greater than 22360. Comparing this to the Lipschitz restriction found earlier, if we use a value of \(\sigma = 0.1\) we would need more than 900 qubits for the smoothed PDE trial solution.

4 Numerical examples

The quantum state simulation is performed using the Yao package for Julia [22]. We use the random quantum network introduced in Sect. 3.1 with a varying number of qubits, which we compare to the random classical network developed by Gonon [10]: let \(E_1\) be \(t_5(0, \textrm{I}_d)\) (multivariate t-distribution) random variable and \(B_1\) a Student t-distribution with two degrees of freedom. At each training iteration we uniformly sample new points from \(\Omega \) and \(\partial \Omega \). We train solutions using both classical and quantum networks. Due to the random nature of the networks we repeat each training process five times, and all training statistics reported below are mean values.

4.1 Poisson equation

Consider the Poisson equation (3.5) with \(p=2\), \(\Omega = (0,1)^2\) and

$$\begin{aligned} f(x,y) = 26\cos (y)\cos (5x) -\frac{2y}{(1 + x)^3}, \end{aligned}$$

so that the explicit solution reads

$$\begin{aligned} u(x,y) = \cos (5x)\cos (y)+\frac{y}{1+x}+\sin (x+y)^2+\cos (x+y)^2. \end{aligned}$$

Alongside quantum and classical networks, we also investigate the two loss functions \({\mathcal {L}}^{\text {Var}}\) and \({\mathcal {L}}\). To demonstrate the effectiveness of the Haar random operator we also train solutions using \(\varvec{\Lambda } = \textrm{I}\). We use six qubits, detailed training information and network settings are shown in Table A. Final relative \(L_2\) errors of the trial solutions are shown in Table 1 alongside other metrics. Regardless of the loss function used the full random quantum networks outperform all of the random classical networks. In Table 1 we see that the performance of the classical networks improves when the number of nodes increases, notably the full random quantum network has 24 trainable parameters yet it outperforms the classical network with more than 4 times the number of trainable parameters. We also see that for both classical and quantum networks the variational formulation loss functions produces trial solutions with approximately the same final \(L_2\) relative error.

The addition of the Haar random operator \(\varvec{\Lambda }\) has little impact on the training complexity since it is fixed. However, it greatly improves the final \(L_2\) relative error, for example reducing the error for the variational loss from 0.575 to 0.040. Figure 1 shows the squared error \(|u_\Theta -u|^2\) over the domain of \(x_1\) for \(x_2 \in \{0.25,0.5,0.75\}\). The solid blue line indicates the average value for the quantum network with the shaded blue area representing all error values from the five runs. For the classical networks we plot the network with the lowest overall error of the five. Figure 2 shows snapshots of the trial solution against the analytic solution. The five quantum networks display similar behaviour over the intervals when compared to each other.

Fig. 1
figure 1

Squared error values \(|u_\Theta (x) -u(x)|^2\) of the trial solutions for the Poisson equation in \({\mathbb {R}}^2\) with \(x_1\) on the horizontal axis and for \(x_2 \in \{0.25,0.5,0.75\}\). The parameter N refers to the number of nodes in the random classical network and \(n=6\) is the number of qubits in the random quantum network. For the classical networks we plot the solution with the lowest overall mean squared error, whereas for the quantum networks we plot the average squared error with a ribbon indicating the range of error for the five different training runs

Fig. 2
figure 2

Results of the quantum networks for the Poisson equation in \({\mathbb {R}}^2\), as a function of \(x_1\) for \(x_2\in \{0,0.5,1\}\). We also plot the analytic solution for comparison

Table 1 Training results for the random networks used to solve the Poisson equation with varying loss function formulations

4.2 p-Laplace with \(p=3\)

Consider the Poisson equation (3.5) with \(p=3\), \(\Omega = (0,1)^2\) and

$$\begin{aligned} f(x,y) = -\frac{\left( - 36 \cos ^{2}{\left( x \right) } \cos ^{2}{\left( 2 y \right) } - 6\sin ^{2}{\left( x \right) } \sin ^{2}{\left( 2 y \right) } +8 \sin ^{2}{\left( x \right) } \cos ^{2}{\left( 2 y \right) } \right) \sin {\left( 2 y \right) } \cos {\left( x \right) }}{\sqrt{\sin ^{2}{\left( x \right) } \sin ^{2}{\left( 2 y \right) } + 4 \cos ^{2}{\left( x \right) } \cos ^{2}{\left( 2 y \right) }}} \end{aligned}$$

so that the explicit solution reads

$$\begin{aligned} u(x,y) = \cos (x)\sin (2y). \end{aligned}$$

Alongside quantum and classical networks, we also investigate the two loss functions \({\mathcal {L}}^{\text {Var}}\) and \({\mathcal {L}}\). We use six qubits, with detailed training information and network settings shown in Appendix A. The quantum network has 18 trainable parameters. Training results are shown in Table 2. It can be seen that the quantum network based solutions are considerably more accurate than the classical solutions using either \({\mathcal {L}}^{\text {Var}}\) or \({\mathcal {L}}\). Note, however, that the classical networks’ loss function values were still decreasing at the end of training.

Figure 3 shows the squared error \(|u_\Theta -u|^2\) over the domain of \(x_1\) for \(x_2 \in \{0.25,0.5,0.75\}\). The solid blue line indicates the average value for the quantum network with the shaded blue area representing all error values from five runs. For the classical networks we plot the network with the lowest overall error of the five runs. Figure 4 shows selected snapshots of the trial solution against the analytic solution.

Table 2 Training results for the random networks used to solve the p-Laplace equation for \(p=3\) with varying loss function formulations
Fig. 3
figure 3

Squared error values \(|u_\Theta (x) -u(x)|^2\) of the trial solutions for the p-Laplace equation with \(p=3\) in \({\mathbb {R}}^2\) with \(x_1\) on the horizontal axis and for \(x_2 \in \{0.25,0.5,0.75\}\). The parameter N refers to the number of nodes in the random classical network and \(n=6\) is the number of qubits in the random quantum network. For the classical networks we plot the solution with the lowest overall mean squared error, whereas for the quantum networks we plot the average squared error with a ribbon indicating the range of error for the five different training runs

4.3 Heat equation

We consider the heat equation (3.22) with the solution (3.23) and \(T = 1\), \(a=0.25\). We solve with \(d=1,2\) using 4 and 6 qubits, respectively. Specific training settings are shown in Appendix B and detailed training results can be found in Tables 3 and 4. For \(d=1\), the quantum network has a lower average MSE than the average values for the classical networks. Specifically, the classical network with the same number of trainable parameters has an MSE an order of magnitude larger. We also train classical networks with both more and fewer trainable parameters and see that the quantum networks outperform all the classical ones. For \(d=2\), the quantum network has 60 trainable parameters and outperforms all the random classical networks with less than 70 parameters. There is a large boost in final MSE when the classical random network has more than 80 nodes; compared to the previous two examples, this could suggest that our quantum network lacks the expressivity needed or more training iterations are required for accurate quantum networks. In Fig. 5, we plot the squared error values \(|u_\Theta -u|^2\) over the domain of \(x_1\) and at values of \(x_2 \in \{0.25,0.5,0.75\}\). The solid blue line indicates the average value for the quantum network with the shaded blue area being a ribbon that covers all error values from the five runs. For the classical networks we plot the network with the lowest overall error of the five. In Fig. 6 we plot snapshots of the trial solution against the analytic solution.

Fig. 4
figure 4

Results of the quantum networks for the for the p-Laplace equation with \(p=3\) in \({\mathbb {R}}^2\), as a function of \(x_1\) for \(x_2\in \{0,0.5,1\}\). We also plot the analytic solution for comparison

4.4 Hamilton–Jacobi equation

We use the specific HJB equation (3.19) with the unique solution [15]

$$\begin{aligned} u(t,x) = -\mu (2\pi )^{-\frac{{d}}{2}} \log \left( \int _{{\mathbb {R}}^{d}} \exp \left\{ -\frac{\Vert y\Vert ^2}{2} - \mu h\left( x-\sqrt{2(T-t)}y\right) \right\} \textrm{d}y\right) . \end{aligned}$$

Due to the domain of x being \({\mathbb {R}}^{d}\) we require a different form of the encoding matrix, as a result we change definition (3.2) to

$$\begin{aligned} \varvec{A}(x):= \bigotimes _{j = 1 }^{n} \texttt{R}_{z}\left[ {\mathcal {X}}_j\pi \tanh \left( x_{{\mathfrak {g}}_{j}}\right) \right] . \end{aligned}$$
Fig. 5
figure 5

Squared error values \(|u_\Theta (x) -u(x)|^2\) of the trial solutions for the Heat equation with \(d=2\) and \(x_1\) on the horizontal axis and for \(x_2 \in \{0.25,0.5,0.75\}\). The parameter N refers to the number of nodes in the random classical network and \(n=4\) is the number of qubits in the random quantum network. For the classical networks we plot the solution with the lowest overall mean squared error, whereas for the quantum networks we plot the average squared error with a ribbon indicating the range of error for the five different training runs

Fig. 6
figure 6

Solution snapshots of the quantum networks for the Heat equation with \(d=2\) and \(x_1\) on the horizontal axis and \(x_2 \in \{0,0.5,1\}\). We also plot the analytic solution for comparison. Note the y-axis for each plot is different

Table 3 Results for the random networks for the Heat equation with \(d=1\)
Table 4 Results for the random networks for the Heat equation with \(d=2\)

We use the activation function \(\tanh \) due to it’s performance in traditional machine learning applications. We solve the specific HJB equation (3.19) with \(d = 2\), \(\mu = 1\), \(T=1\), \(h(x) = \log \left( \frac{1+\Vert x\Vert ^2}{2}\right) \) and 9 qubits alongside 1 trainable layer. Training results are shown in Table 5 and training settings in Appendix C. We see relatively poor performance for both sets of random networks, which is due to hardware limitations. Indeed, the MSE during training did not plateau for any of the models, showing than more training iterations should be used. Calculating the derivatives needed for the HJB equation using a network architecture of 9 qubits requires much larger computational resources, which we leave to further study.

Table 5 Training results for the random networks used to solve the given HJB equation

5 Conclusion

This paper develops parameterised quantum circuits to solve widely used PDEs in any dimension. It further provides a precise complexity study of these algorithms and compare them to their classical counterparts, highlighting their potential advantages and limitations. In the future work, we will investigate the impact of quantum noise on these quantum circuit-based PDE solutions using both quantum noise models and actual quantum hardware.