Neural Error Mitigation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Neural Error Mitigation of Near-Term Quantum Simulations

Elizabeth R. Bennewitz,1, 2, 3, ∗ Florian Hopfmueller,1, 2, 3, ∗


Bohdan Kulchytskyy,1 Juan Carrasquilla,4, 2 and Pooya Ronagh5, 2, 3, 6, †
1
1QB Information Technologies (1QBit), Waterloo, ON, Canada
2
Department of Physics & Astronomy, University of Waterloo, Waterloo, ON, Canada
3
Perimeter Institute for Theoretical Physics, Waterloo, ON, Canada
4
Vector Institute, MaRS Centre, Toronto, ON, Canada
5
Institute for Quantum Computing, University of Waterloo, Waterloo, ON, Canada
6
1QBit, Vancouver, BC, Canada
(Dated: January 31, 2023)
Near-term quantum computers provide a promising platform for finding ground states of quan-
tum systems, which is an essential task in physics, chemistry, and materials science. Near-term
approaches, however, are constrained by the effects of noise as well as the limited resources of
arXiv:2105.08086v2 [quant-ph] 30 Jan 2023

near-term quantum hardware. We introduce neural error mitigation, which uses neural networks to
improve estimates of ground states and ground-state observables obtained using near-term quantum
simulations. To demonstrate our method’s broad applicability, we employ neural error mitigation to
find the ground states of the H2 and LiH molecular Hamiltonians, as well as the lattice Schwinger
model, prepared via the variational quantum eigensolver (VQE). Our results show that neural error
mitigation improves numerical and experimental VQE computations to yield low energy errors, high
fidelities, and accurate estimations of more-complex observables like order parameters and entangle-
ment entropy, without requiring additional quantum resources. Furthermore, neural error mitigation
is agnostic with respect to the quantum state preparation algorithm used, the quantum hardware
it is implemented on, and the particular noise channel affecting the experiment, contributing to its
versatility as a tool for quantum simulation.

I. INTRODUCTION tation. However, in practice, implementing QEC imposes


a large overhead in terms of both the required number of
Since the early twentieth century, scientists have been qubits and low error rates, both of which remain beyond
developing comprehensive theories that describe the be- the capabilities of current and near-term devices.
haviour of quantum mechanical systems. However, the Before fault-tolerant quantum simulations [9] can be
computational cost required to study these systems often realized, modern variational algorithms significantly al-
exceeds the capabilities of current scientific computing leviate the demand on quantum hardware and exploit
methods and hardware. Consequently, computational in- the capabilities of noisy intermediate-scale quantum
feasibility remains a roadblock for the practical applica- (NISQ) devices [10, 11]. A prominent example is the
tion of those theories to problems of scientific and tech- variational quantum eigensolver (VQE) [12], a hybrid
nological importance. quantum–classical algorithm that iteratively approxi-
The simulation of quantum systems on quantum com- mates the lowest-energy eigenvalues of a target Hamil-
puters, referred to in this paper as quantum simulation, tonian through the variational optimization of a family
shows promise toward overcoming these roadblocks, and of parameterized quantum circuits. This, and other vari-
has been a foundational driving force behind the concep- ational algorithms, has emerged as a leading strategy to-
tion and creation of quantum computers [1–4]. In par- ward achieving a quantum advantage using near-term de-
ticular, the quantum simulation of ground and steady vices and accelerating progress in multiple scientific and
states of quantum many-body systems beyond the ca- technological fields [13].
pabilities of classical computers is expected to signifi-
The experimental implementation of variational quan-
cantly impact nuclear physics, particle physics, quantum
tum algorithms remains a challenge for many scientific
gravity, condensed matter physics, quantum chemistry,
problems, as NISQ devices suffer from various sources of
and materials science [5–8]. The capabilities of current
noise and imperfection. To alleviate these issues, sev-
and near-term quantum computers continue to be con-
eral methods for quantum error mitigation (QEM) have
strained by limitations, such as the number of qubits and
been proposed and experimentally validated that im-
the effects of noise. Quantum error correction (QEC)
prove quantum computations in the absence of the quan-
techniques can eliminate errors that result from noise,
tum resources required for QEC [14]. For a review of cur-
providing a path toward fault-tolerant quantum compu-
rent QEM techniques, we refer the reader to Ref. [13] and
the material cited therein. In general, these methods use
specific information about the noise channels that affect a
∗ E. R. B and F. H. have contributed equally to this work. quantum computation, the hardware implementation, or
† Corresponding author: [email protected] the quantum algorithms themselves. Examples include
the implicit characterization of noise models and how
2

Quantum Neural results in natural language and image processing, and has
Variational
Simulation on Quantum State the potential to model long-range quantum correlations.
Monte Carlo
Quantum Device Tomography
We refer the reader to the Methods section and Supple-
Prepare a ground Reconstruct |Ψg i Post-process |Ψλ i
mentary Information for a complete description of NQS,
state: with a neural using the
|Ψ0 i |Ψg i quantum state variational NQST, VMC, and the Transformer neural network.
U
|Ψλ i from a principle: Neural error mitigation has several advantages com-
measurement E0 ≈ min Eλ = pared to other error mitigation techniques. Firstly, it
Take projective dataset D min hΨλ | Ĥ |Ψλ i has a low experimental overhead; it requires only a set
measurements: of simple experimentally feasible measurements to learn
|Ψg i Pass the trained |Ψλ i Return |Ψλ iEM the properties of the noisy quantum state prepared by
to VMC algorithm VQE. Consequently, the overhead of error mitigation in
Neural Error Mitigation NEM is shifted from quantum resources (i.e., performing
additional quantum experiments and measurements) to
FIG. 1: Neural error mitigation procedure | First, an classical computing resources for machine learning. In
approximate ground state |Ψg i is prepared on a quantum computer
from which simple projective measurements are taken (left column).
particular, we note that the primary cost of NEM is in
This measurement dataset, D, is then used to reconstruct the final performing VMC until convergence. Another advantage
state |Ψg i with a neural quantum state |Ψ~λ
i using neural quantum of NEM is that it is agnostic with respect to the quan-
state tomography (middle column). Then, the neural network ansatz
is post-processed using variational Monte Carlo to mitigate errors in tum simulation algorithm, the device it is implemented
the ground-state representation (right column). on, and the particular noise channel affecting the quan-
tum simulation. As a result, it can also be combined
with other QEM techniques [20, 21], and can be applied
they affect estimates of the desired observables, specific to either analog quantum simulation or digital quantum
knowledge of the state subspaces in which the prepared circuits [22, 23].
quantum state ought to reside, and the characterization Neural error mitigation also alleviates the low mea-
and mitigation of the sources of noise on individual com- surement precision that arises when estimating quantum
ponents of the quantum computation such as single- and observables using near-term quantum devices. This is
two-qubit gate errors, as well as state preparation and particularly important in quantum simulations, where
measurement (SPAM) errors. making accurate estimations of quantum observables is
Machine learning techniques, which have recently been essential for practical applications. Neural error miti-
repurposed as tools for tackling complex problems in gation intrinsically resolves the low measurement preci-
quantum many-body physics and quantum information sion at each step of the algorithm. During the first step,
processing [15, 16], provide an alternative route to QEM. NQST improves the variance of observable estimates at
Here we introduce a QEM strategy named neural error the cost of introducing a small estimation bias [24]. This
mitigation (NEM), which uses neural networks to miti- bias, as well as the remaining variance, is further reduced
gate errors in the approximate preparation of the quan- by training the NEM ansatz using VMC, which results
tum ground state of a Hamiltonian. in a zero-variance expectation value for energy estimates
The NEM algorithm, summarized in Fig. 1, is com- once the ground state has been reached [25].
posed of two steps. First, we perform neural quantum
state tomography (NQST) to train a neural quantum
state (NQS) ansatz to represent the approximate ground II. RESULTS
state prepared by a noisy quantum device, using exper-
imentally accessible measurements. Inspired by tradi- A. Quantum Chemistry Results
tional quantum state tomography (QST), NQST is a
data-driven machine learning approach to QST that uses Accurately simulating a molecule’s electron correla-
a limited number of measurements to efficiently recon- tions is an integral step in characterizing the chemical
struct complex quantum states [8]. We then apply the properties of the molecule. This problem, known as the
variational Monte Carlo algorithm (VMC) on the same electronic structure problem, involves finding the ground-
neural quantum state ansatz (which we call the NEM state wavefunction and energy of many-body interacting
ansatz) to improve the representation of the unknown fermionic molecular Hamiltonians. Achieving an abso-
ground state. In the spirit of VQE, VMC approximates lute energy error corresponding to chemical accuracy (1
the ground state of a Hamiltonian based on a classical kcal/mol ≈ 0.0016 hartrees), the threshold for accurately
variational ansatz [18], in this case a NQS ansatz. estimating room-temperature chemical reaction rates, is
In this paper, we use an autoregressive generative neu- essential to applications in drug discovery and materials
ral network as our NEM ansatz. In particular, we use the science [7].
Transformer [1] architecture, and show that this model We demonstrate the application of NEM to the esti-
performs well as a neural quantum state. Due to its ca- mation of molecular ground states prepared using a VQE
pability to model long-range temporal and spatial corre- algorithm and show that our method improves the results
lations, this architecture has led to many state-of-the-art up to chemical accuracy or better for the H2 and LiH
3

a b c
Energy (hartrees)
−6.5 100

|∆E| (hartrees)
Exact 10−1
LiH VQE (IBMQ-Rome) 10−1

Infidelity
−7.0 NQST 10−2 Chemical Accuracy 10−2
NEM 10−3 10−3
−7.5
10−4 10−4
10 −5 10−5
1 2 3 4 1 2 3 4 1 2 3 4
d e f
Energy (hartrees)

100

|∆E| (hartrees)
Exact 10−1
0.0
H2 VQE 10 −2 10−1

Infidelity
NQST 10−2
−0.5 NEM
10−3
10−3
10−4
10−4
−1.0 10−5
10−5
1 2 3 4 1 2 3 4 1 2 3 4
g h i
Energy (hartrees)

−6.5 100
10−1
|∆E| (hartrees)

LiH 10−1

Infidelity
−7.0 10−2
10−2
10−3
−7.5 10−3
10−4
10−4
1 2 3 4 1 2 3 4 1 2 3 4
Bond Length (Å) Bond Length (Å) Bond Length (Å)

FIG. 2: Neural error mitigated experimental and numerical results for molecular Hamiltonians | Neural error mitigation results for
energy, energy error, and infidelity for the ground states of LiH and H2 prepared using a hardware-efficient variational quantum circuit. Each
panel contains information about the prepared quantum states (blue triangles), neural quantum states trained using neural quantum state
tomography (green circles), the final neural error mitigated states (red diamonds), and, where appropriate, the exact results (solid black line).
The top row (a - c) shows results for the LiH ground state prepared experimentally using IBM’s five-qubit device, IBMQ-Rome. The middle
row (d - f ) and bottom row (g - i) show the performance of neural error mitigation for numerically prepared H2 and LiH ground states,
respectively. Results are shown for the median performance over 10 noisy numerical simulations per bond length, and the shaded region is the
interquartile range. For our ten data points, the interquartile range includes the middle six and excludes the best two and worst two in order to
indicate the typical performance of the method. Error mitigated results extend both the experimentally and numerically prepared VQE states to
chemical accuracy and low infidelities for all bond lengths of LiH and H2 . Chemical accuracy is shown at 0.0016 hartrees (dashed black line).

molecules (see Fig. 2 for experimental and numerical re- tion to obtain the final prepared quantum state. Neural
sults). We map the H2 and LiH molecular Hamiltonians error mitigation improves the results of VQE to chemical
computed in the STO-3G basis to qubit Hamiltonians accuracy or better for all bond lengths, and achieves in-
with N = 2 and N = 4 qubits, respectively [7]. The fidelities, given by 1 − | hΨ|Ψ0 i |2 , of 10−3 for most bond
prepared quantum state is the hardware-efficient varia- lengths (shown in top row of Fig. 2). On average, NEM
tional quantum circuit composed of single-qubit Euler achieves an improvement of three orders of magnitude on
rotation gates and two-qubit CNOT entangling gates na- energy estimation and two orders of magnitude on infi-
tive to superconducting hardware [7]. For both H2 and delity. We provide further details about the reconstruc-
LiH, we construct a variational circuit with a single en- tion quality in the Supplementary Information, including
tangling layer, giving a variational circuit with 10 and 20 an analysis of the reconstructed neural quantum state.
parameters, respectively. More details can be found in Additionally, we illustrate the results of applying NEM
the Methods section. on the ground states of H2 and LiH prepared using clas-
We highlight the performance of NEM on the exper- sically simulated VQE with a depolarizing noise chan-
imental preparation of the ground states of LiH at dif- nel (shown in bottom two rows of Fig. 2). We simulate
ferent bond lengths using IBM’s five-qubit chip, IBMQ- VQE with a single-qubit depolarizing error probability
Rome. We map the four-qubit LiH problem to the four of 0.001 and two-qubit depolarizing error probability of
linearly connected qubits on IBMQ-Rome that have the 0.01. At each bond length, we generate 10 VQE simula-
lowest average single- and two-qubit gate errors. During tions and report the NEM results. Notably, the median
optimization, we perform 250 iterations of simultaneous performance of NEM improves the ground-state estima-
perturbation stochastic approximation (SPSA) optimiza- tion of H2 and LiH to chemical accuracy and low infi-
4

a c e
−2 10−1

Ent. Entropy
0.6
10−2
Energy

∆E/N
−3
0.4 10−3
−4 10−4
0.2
10−5
−1.5 −1.1 −0.7 −0.3 0.1 −1.5 −1.1 −0.7 −0.3 0.1 6 8 10 12 14 16
b Mass d Mass f System Size
Order Parameter

Exact
0.75 10−1 VQE b = 210
VQE 10−1

Infidelity

Infidelity
NQST b = 212
NQST 10−2
0.50 NEM: b = 28 b = 214
NEM
10−2 10−3
0.25
10−4
−3
10 10−5
−1.5 −1.1 −0.7 −0.3 0.1 −1.5 −1.1 −0.7 −0.3 0.1 6 8 10 12 14 16
Mass Mass System Size

FIG. 3: Performance of neural error mitigation applied to ground states of the lattice Schwinger model | Left: Estimates for (a) the
ground state energy, (b) order parameter, (c) entanglement entropy between the first three and remaining five sites, and (d) infidelity to the
exact ground state, for the N = 8 site model. Each panel contains results for the quantum states prepared using VQE simulated with a
depolarizing noise channel (blue triangles), neural quantum states trained using neural quantum state tomography (green circles), the final neural
error mitigated neural quantum states (red diamonds), and, where applicable, exact results (solid black lines). While the qualitative behaviour of
the entanglement entropy and the order parameter across the phase transition are not modelled well by VQE or NQST, applying NEM
consistently improves the estimates of all observables to low errors and low infidelities. Right: Results of the scaling study of NEM at the phase
transition (m = −0.7) are shown for (e) the energy error per site and (f ) infidelity, as a function of system size. The performance of VQE
without noise (blue triangles) shows an approximately constant energy error per site with infidelities that become slightly worse as system size
increases. Across all sizes, applying NEM (red diamonds) improves VQE performance by two to four orders of magnitude, even when using a
small VMC batch size of b = 28 , which is the number of samples used to estimate the energy’s gradient in one iteration of VMC. In all panels
(left and right), median values over 10 runs are shown, and the shaded region is the interquartile range.

delities for all bond lengths. The increased infidelity for form:
bond lengths larger than 2Å, as shown in the infidelity
N −1 N
plots of Fig. 2, can be explained by the decreasing en- w X  mX
ergy gap between the ground state and the first excited Ĥ = X̂j X̂j+1 + Ŷj Ŷj+1 + (−1)j Ẑj
2 j=1 2 j=1
state. When the energy gap is small, it becomes more
N
X
difficult for methods that optimize the energy, like VQE
and VMC, to isolate the ground-state representation. + ḡ L̂2j . (1)
j=1

The first term describes the creation and annihilation


of electron–positron pairs and contains an overall en-
B. Lattice Schwinger Results ergy scale, w. The second term contains the bare elec-
tron mass m, and the third term contains ḡ, which is
the coupling strength to the electric field L̂j . Solv-
We next apply our method to the ground state of the ing for the electric field in one spatial dimension gives
lattice Schwinger model, which is a prototypical abelian Pj
L̂j = 0 − 12 `=1 Ẑ` + (−1)` Iˆ , where 0 is an integra-
lattice gauge theory, and a toy model for quantum elec- tion constant. Given that the quantum fields at one
trodynamics in one spatial dimension. Multiple experi- spatial lattice point are encoded into a pair of qubits,
ments have been proposed that use quantum devices to the total number of sites N must be even. We set
explore the properties of this model [27–30]. In this pa- w = 1, ḡ = 1, and 0 = 0 such that the only remaining
per, we consider the experiment where a trapped-ion ana- parameter is the mass m. The ground state of the sys-
log quantum simulator is used to variationally prepare tem for m → +∞ describes a vacuum with no electron–
the ground state of the lattice Schwinger model using al- positron pairs and for m → −∞ describes a large number
ternating entangling operations, eitHE , which act on all of electron–positron pairs. In the thermodynamic limit,
qubits simultaneously, and single qubit rotations [27]. the model exhibits a second-order phase transition at
After using a Jordan–Wigner transformation to map m ≈ −0.7, which can Pbe detected using the order parame-
1 i j
the fermionic degrees of freedom of the theory to qubits, ter hOi = 2N (N −1) i,j>i h(1 + (−1) Ẑi )(1 + (−1) Ẑj )i.
the lattice Schwinger Hamiltonian takes the following [27] The model possesses discrete symmetries, which inform
5

the choice of a variational quantum circuit with a man- ground states and ground-state observables obtained
ageable number of parameters, but, in order to demon- from two example classes of near-term quantum simu-
strate the general applicability of NEM, we do not enforce lations, independent of the quantum device and noise
these symmetries on the neural quantum state. channels. Additionally, we show that NEM exhibits the
We demonstrate the performance of neural error mit- potential to scale well for such quantum experiments.
igation by applying it to the approximate ground state Given its low quantum overhead, NEM can be a pow-
of the lattice Schwinger model obtained by numerically erful asset for the error mitigation of near-term quantum
simulating a VQE algorithm for N = 8 sites, with single- simulations.
qubit depolarizing noise with probability λ = 0.001 ap- An advantage of using techniques based on neural
plied after each rotation and entangling operation. As quantum states for the task of quantum error mitiga-
shown in Fig. 3a through Fig. 3d, the simple VQE scheme tion is the ability to approximate complex wavefunctions
we employ exhibits median infidelities between 0.10 and from simple experimental measurements. In the process
0.31, with worse performance closer to the phase transi- of improving the energy estimation performed by VQE,
tion around m = −0.7. While the qualitative behaviour NEM reconstructs and improves the ground-state wave-
of the ground-state energy as a function of the mass function itself as a neural quantum state. The accurate
is modelled approximately by VQE, the qualitative be- final representation of the ground-state wavefunction is
haviour of other physical properties is not reproduced the reason why NEM is able to accurately reconstruct
well, limiting the utility of our VQE results for studying and improve estimations of complex observables like en-
the phase transition. This includes the order parameter, ergy, order parameters, and entanglement entropy with-
and the Renyi entanglement entropy S2 of a partition out imposing additional quantum resources.
of the system, which is a broadly used, experimentally By combining VQE, which uses a parametric quan-
accessible quantity [31, 32] expressing the amount of cor- tum circuit as an ansatz, and NQST and VMC, which
relation present in the quantum state. use neural networks as an ansatz, NEM brings together
The properties of the NEM state show a substantial two families of parametric quantum states and three op-
improvement over VQE. The NEM state reaches abso- timization problems over their loss landscapes [33–35].
lute energy errors on the order of 10−2 , and infidelities Our work raises the question as to the nature of the
approaching 10−3 . Importantly, after applying NEM, the relationships between these families of states, their loss
physical properties estimated by the state accurately fol- landscapes, and quantum advantage. Examining these
low their exact values. The ability to obtain precise es- relationships offers a new way to investigate the poten-
timations of these physical properties can be explained tial of NISQ algorithms in seeking a quantum advantage.
by the accurate representation of the ground-state wave- This may lead to a better delineation between classically
function captured by the NEM neural quantum state. tractable simulations of quantum systems and those that
Further details about the reconstruction quality of each require quantum resources.
component are covered in the Supplementary Informa-
tion, including a thorough analysis of the NEM neural
quantum state. IV. METHODS
To gather evidence that the performance of NEM
scales well to larger near-term experiments on quantum A. Neural Quantum State
devices, we study the behaviour of NEM as a function of
system size for the lattice Schwinger model. For compu- Our neural quantum state is based on the Trans-
tational efficiency, the scaling study uses a modified VQE former [1], an architecture originally developed to process
implementation without noise (see Fig. 3e and Fig. 3f) sequences that have temporal and spatial correlations,
as compared to the simulated trapped-ion experiment such as written languages. Compared to previous archi-
(see Fig. 3a through Fig. 3d). For more details on the tectures for sequence models such as the long short-term
modified circuit, we refer the reader to the Methods sec- memory (LSTM) neural network [36], the Transformer
tion. The VQE algorithm is simulated on a classical com- excels at modelling long-range correlations, and has thus
puter for system sizes up to N = 16, and NEM is applied become very popular in machine learning. Within the
to the resulting states. The results in Fig. 3e and Fig. 3f quantum many-body machine learning community, there
show that NEM improves upon the VQE results by two has been a lot of work using autoregressive neural net-
to four orders of magnitude, even for large system sizes, works as neural quantum states [37, 38]. Recently, the
using modest classical resources for the estimation of en- Transformer has been adapted as an autoregressive gen-
ergy gradients in VMC. erative neural quantum state [39].
We represent the quantum state |ψi with a Trans-
former neural network that takes as input a bitstring
III. DISCUSSION s = (s1 , . . . , sN ) ∈ {0, 1}N , describing a computational
basis state |si, where N is the number of qubits. The
The error mitigation strategy developed here demon- neural network outputs two numbers (p~λ (s), ϕ~λ (s)) pa-
strates significant improvements to the estimations of rameterized by the neural network weights ~λ, which form
6

the complex amplitude hs|ψ~λ i given by Here, pVQE (s, B) is the exact, unknown likelihood of
q measuring |s, Bi from the VQE state. The cross entropy
hs|ψ~λ i = p~λ (s)eiϕ~λ (s) . (2) achieves its minimum in ~λ if pVQE (s, B) = p~λ (s, B). As
commonly done in unsupervised learning, the cross en-
Here, p~λ (s) is a normalized probability distribution, tropy is approximated using the set D of the measured
which automatically normalizes the quantum state. The samples |s, Bi, which is further partitioned into training
autoregressive property of the model allows for efficient and validation subsets DT,V . The loss function used in
sampling from the Born distribution of |ψ~λ i in the com- training is
putational basis. More details may be found in the Sup-
plementary Information.
The exact ground state amplitudes of both the quan- 1 X
tum chemistry models and the lattice Schwinger model L~λ ≈ − ln p~λ (s, B). (5)
|DT |
|s,Bi∈DT
are real, that is, ϕ(s) ∈ {0, π}, and the signs of the lattice
Schwinger model ground-state amplitudes follow a sim-
The training is performed using stochastic gradient de-
ple sign rule. However, to show the general applicability
scent (SGD) with the Adam [40] optimizer.
of our method, we do not enforce any of these conditions
in our neural quantum states.
C. Variational Monte Carlo and Regularization
B. Neural Quantum State Tomography
Variational Monte Carlo is a method that adjusts
In neural quantum state tomography [8], a neural net- the parameters of a (classical) variational wavefunction
work is trained to represent the state of a quantum de- ansatz in order to approximate the ground state of a
vice using samples from that state in various Pauli bases given Hamiltonian. The method usually proceeds by
(i.e., after performing various post-rotations). Neural gradient-based optimization of the energy, where the en-
quantum state tomography proceeds by iteratively ad- ergy and its partial derivatives with respect to the ansatz
justing the NQS parameters to maximize the likelihood parameters are estimated using Monte Carlo samples
that NQS assigns to the samples. drawn from the classical variational wavefunction. As de-
A sample s ∈ {0, 1}N in a Pauli basis B = P1 P2 · · · PN , tailed in the Supplementary Information, the autoregres-
with Pi ∈ {X, Y, Z}, is the unique simultaneous eigen- sive property of our neural network wavefunction allows
state of the single qubit Pauli operators Pi , with eigen- for efficient sampling of the learned probability distribu-
values determined by the entries of s. We denote such tion. This leads to more efficient VMC training com-
a state by |s, Bi. The likelihood of the sample (s, B) pared to models such as restricted Boltzmann machines
according to the neural quantum state |ψ~λ i is given by (RBM) [10, 41], where samples have to be obtained using
Markov chain Monte Carlo.
2 Many implementations of VMC use second-order
p~λ (s, B) = hs, B|ψ~λ i methods involving the Hessian of the energy [43], or other
X 2 update rules such as stochastic reconfiguration [41], to
= hs, B | ti t ψ~λ . (3) update the parameters. These methods tend to be com-
N
t∈{0,1} putationally expensive for models with large numbers of
hs,B|ti6=0
parameters. Instead, we use SGD via the Adam opti-
Here, we sum over the computational basis states |ti that mizer, leading to an update-step cost that is linear in the
have a non-zero overlap with the given sample |s, Bi. For number of parameters, and hence scales more favourably
a single sample |s, Bi, the number of these states |ti is for larger models.
2K , where K is the number of positions i where Pi 6= Z. We find it necessary to add a regularization term to
Therefore, the computational cost of a single iteration of the VMC objective in the early stages of VMC optimiza-
tomography training is proportional to 2K . To constrain tion. Without it, the magnitudes of the amplitudes of
this computational cost, we use projective measurements some computational basis states decrease to almost zero
in almost-diagonal Pauli bases (i.e., Pauli bases B with over the course of training, even for computational ba-
low numbers of X or Y terms). sis states which have a non-zero overlap with the true
In order to learn the quantum state from a set of mea- ground state. It has been previously noted [10] that the
surements D, the objective function minimized during VMC algorithm has difficulty finding the ground state
NQST is an approximation of the cross entropy aver- of molecular Hamiltonians because they have sharply
aged over the set of bases B from which samples were peaked amplitudes in a sparse subset of the computa-
drawn [24], and is given by tional basis states. The regularization term is designed
to increase very small amplitudes. It maximizes the L1
1 X X norm of the state, thereby discouraging sparsity. This
L~λ = − pVQE (s, B) ln p~λ (s, B). (4) is in contrast to the common usage of L1 regulariza-
|B|
B∈B s∈{0,1}N tion in machine learning and optimization where the L1
7

norm is minimized to encourage sparsity in sparse opti- use this circuit for both our numerical simulations with a
mization tasks. We expect this technique to be useful depolarizing noise channel, as well as to perform experi-
for systems where the ground state has a large overlap ments on a five-qubit superconducting quantum proces-
with one or a few computational basis states. For exam- sor (Fig. 2).
ple, ground states of electronic structure problems have a The H2 and LiH molecular Hamiltonians are mapped
large overlap with the Hartree–Fock state and the lattice to qubit Hamiltonians with N = 2 and N = 4 qubits,
Schwinger ground states have a large overlap with the respectively [7]. In particular, we map the second-
ground states for m → ±∞. This regularization tech- quantized fermionic Hamiltonian for H2 to its qubit
nique allows the NQS to capture the subdominant am- Hamiltonian using the Bravyi–Kitaev transformation [44]
plitudes, rather than collapsing onto the dominant com- while the LiH Hamiltonian is transformed using the par-
putational basis state early in the training process. The ity transformation [44, 45]. In each case, two qubits asso-
L1 norm can be estimated in a tractable manner because ciated with the spin-parity symmetries of the model are
we use an autoregressive, generative neural network as removed to obtain final qubit Hamiltonians [46].
our neural quantum state, which allows for the exact
sampling of the learned probability distribution and is The hardware-efficient variational quantum circuit is
automatically normalized. More details on the regular- composed of single-qubit rotations and two-qubit entan-
ization term and the VMC algorithm can be found in the gling gates native to superconducting hardware. The
Supplementary Information. variational circuit,

a
E
~
Ψ(θ)
N h
!
d
Y Y i
|0i Rz (θ11) Rz (θ21) Rz (θ31) ~ =
|ψ(θ)i q,l
UEUR ~ × UENT
(θ)
l=1 q=1
|1i Rz (θ21) Rz (θ22) Rz (θ32) N h
Y i
× ~ |00 · · · 0i ,
U q,0 (θ) (6)
|Ψ0i .. . .. . .. . .. ..
eit1.ĤE eit2.ĤE eit3.ĤE
q=1
|0i Rz (−θ12) Rz (−θ22) Rz (−θ32)

for N qubits consists of d CNOT entangling layers alter-


|1i Rz (−θ11) Rz (−θ21) Rz (−θ31)
nating with N (d + 1) single-qubit Euler rotations, given
Variational Circuit Projective
Measurements by U (θ)~ = RZ (θ1 )RX (θ2 )RZ (θ3 ). In the first rotation
~ the first set of Z rotations is not im-
layer, U q,0 (θ),
b plemented, reducing the number of circuit parameters.
Within each entangling layer, we apply CNOT gates on
1 pairs of linearly connected qubits. The variational circuit
e 2 it(XX+Y Y )
1
has p = N (3d + 2) independent parameters. For both H2
e 2 it(XX+Y Y ) and LiH, we construct a variational circuit with d = 1
O6 (t) = 1
e 2 it(XX+Y Y )
entangling layers giving 10 and 20 parameters, respec-
1
tively.
e 2 it(XX+Y Y )
1
The variational circuit is then optimized using Qiskit’s
e 2 it(XX+Y Y ) implementation of simultaneous perturbation stochastic
approximation (SPSA) [47] for 250 iterations to obtain
FIG. 4: VQE ansatz circuits for the lattice Schwinger model | an estimation for the ground-state energy of H2 and LiH.
(a) Variational quantum circuit used to prepare the approximate Each SPSA iteration requires two energy evaluations. In
ground state of the lattice Schwinger model, using VQE simulated
classically. The input state |Ψ0 i is |01 · · · 01i (|10 · · · 10i) for order to reduce the sampling overhead during the en-
m ≥ −0.7 (m < −0.7). (b) For the results shown in Fig. 3e ergy estimations, Pauli terms in each Hamiltonian are
and Fig. 3f, the entangling layers of (a) are replaced with ON for N
sites. The layers of single-qubit gates are left unchanged. grouped according to their common tensor product ba-
sis [7], requiring only two and 25 circuits with unique
post-rotations for H2 and LiH, respectively, to estimate
the energy.
D. The VQE Implementation for Quantum In order to perform NQST, we collect almost-diagonal
Chemistry measurement samples from the final noisy state prepared
by the variational procedure. In this case, the nearly di-
We use the variational quantum circuit in Eq. (6) for agonal samples are taken in the following bases: in the
the electronic structure problem in quantum chemistry. all-Z basis, in the N bases with one X (and Z elsewhere),
This ansatz was designed for the hardware capabilities and in the (N (N − 1)/2) bases with two Xs (and Z else-
of current superconducting quantum processors [7]. We where).
8

E. The VQE Implementation for the Lattice hardware, especially to capacitively coupled, flux-tunable
Schwinger Model transmon qubits, where the interaction XX + Y Y can
be easily implemented [49]. For the scaling study, 1024
Our variational quantum circuit for the lattice samples are taken in each basis to estimate the energy
Schwinger model closely follows the variational circuit during SPSA optimization. The hyperparameter A of
implementation on a trapped-ion analog quantum sim- SPSA is increased to 20 and the other parameters are
ulator [27] that approximately preserves the symmetries left unchanged.
of the lattice Schwinger model.
The quantum state is first prepared in |01 · · · 01i for
m ≥ 0.7 and |10 · · · 10i for m < −0.7, coinciding with ACKNOWLEDGEMENTS
the ground states of the Schwinger Hamiltonian Eq. (1)
for m → ±∞. On this initial state, three alternating E. R. B., F. H., B. K., and P. R. thank 1QBit for finan-
layers of evolution with an entangling Hamiltonian, fol- cial support. During part of this work, E. R. B. and F. H.
lowed by Z rotations on each qubit, are applied. The en- were students at the Perimeter Institute and the Univer-
tangling Hamiltonian contains long-range XX couplings sity of Waterloo and received funding through Mitacs,
and a uniform effective magnetic field, and is given by and F. H. was supported through a Vanier Canada Grad-
uate Scholarship. Research at the Perimeter Institute is
N
X −1 N
X XN
1 supported in part by the Government of Canada through
ĤE = J X̂j X̂k + B Ẑj . (7) the Department of Innovation, Science and Economic
j=1 k=j+1
|j − k|α j=1
Development and by the Province of Ontario through
We choose α = 1, J = 1, and B = 10 to approximate the Ministry of Colleges and Universities. E. R. B. was
the trapped-ion experimental setup [27]. Evolution with also supported with a scholarship through the Perimeter
this Hamiltonian preserves the symmetries of the lattice Scholars International program. P. R. thanks the finan-
Schwinger model to first order terms in J/B. cial support of Mike and Ophelia Lazaridis, and Innova-
Only half of the parameters in each single-qubit rota- tion, Science and Economic Development Canada. J. C.
tion layer are independent, as required by the symme- acknowledges support from the Natural Sciences and En-
tries, giving ϕj = −ϕN +1−j for j ∈ {N/2 + 1, . . . , N }. gineering Research Council, a Canada Research Chair,
In total, the variational circuit has 15 independent pa- the Shared Hierarchical Academic Research Computing
rameters for N = 8 lattice sites, which are initialized to Network, Compute Canada, a Google Quantum Research
zero at the start of each optimization. As a simple noise Award, and the Canadian Institute for Advanced Re-
model, after each entangling layer and each single-qubit search (CIFAR) AI chair program. Resources used by
rotation layer, a depolarizing channel with λ = 0.001 is J. C. in preparing this research were provided, in part,
applied to each qubit (Extended Data Fig. 1a). by the Province of Ontario, the Government of Canada
To optimize the variational parameters, the energy is through CIFAR, and companies sponsoring the Vector
estimated by taking samples in each of the three bases Institute (www.vectorinstitute.ai/#partners). P. R.
Z ⊗N , X ⊗N , and Y ⊗N . The SPSA hyperparameter val- thanks Christine Muschik for useful conversations. The
ues are chosen by inspecting the variance and approxi- authors thank Marko Bucyk for carefully reviewing and
mate gradient at the beginning of the optimization [48]. editing the manuscript.
The exact values are listed in the Supplementary Infor-
mation.
AUTHOR CONTRIBUTIONS
As input to NQST, almost-diagonal samples are taken
in each of the following 2N − 1 Pauli bases: the all-Z ba-
sis, the (N − 1) bases with XX at a pair of neighbouring E. R. B. and F. H. made equal contributions. They
sites (with Z elsewhere), and the (N − 1) bases with Y Y developed the codebase for all studies, performed numer-
at a pair of neighbouring sites (with Z elsewhere). Note ical experiments, and analyzed the results. E. R. B. per-
that for the Hamiltonian given by Eq. (1), the samples formed experiments using the IBM quantum processor.
provide an estimation of the energy. E. R. B. and F. H. focused on the quantum chemistry and
The results of the scaling study shown in Fig. 3e the lattice Schwinger model case studies, respectively. All
and Fig. 3f use a modified VQE implementation for com- authors contributed to ideation and dissemination. J. C.
putational efficiency. Instead of evolving the circuit us- and P. R. contributed to the theoretical foundations and
ing the entangling Hamiltonian, we use an entangling design of the method.
layer ON comprising two layers of two-qubit gates sim-
ulated without noise, as depicted in Extended Data Fig.
1b. The entangling layer was chosen to exactly preserve DATA AVAILABILITY
the symmetries of the lattice Schwinger model, while be-
ing easier to simulate numerically than evolution with The experimental and numerical quantum simulation
ĤE . Note that, since it is composed of nearest-neighbour measurement data are available from Fig. 2 for H2 , LiH
gates, it is also suited for superconducting quantum as well as the measurement data used in Fig. 3a through
9

Fig. 3d for the 8 site lattice Schwinger model at https:// cally implement the quantum simulations studied in
github.com/1QB-Information-Technologies/NEM (see the manuscript can be found at https://github.
Zenodo repository [50]). com/1QB-Information-Technologies/NEM (see Zenodo
repository [50]).

CODE AVAILABILITY STATEMENT


COMPETING INTERESTS STATEMENT
The numerical implementation of neural error
mitigation as well as the code used to numeri- The authors declare no completing interests.

[1] Feynman, R. P. Simulating physics with computers. Int. [20] Cai, Z. Multi-exponential error extrapolation and com-
J. Theor. Phys. 21 (1982). bining error mitigation techniques for NISQ applications.
[2] Bennett, C. H. Logical reversibility of computation. arXiv:2007.01265 (2020).
IBM Journal of Research and Development 17, 525–532 [21] Torlai, G. et al. Integrating neural networks with a quan-
(1973). tum simulator for state reconstruction. Phys. Rev. Lett.
[3] Benioff, P. The computer as a physical system: A micro- 123, 230504 (2019).
scopic quantum mechanical Hamiltonian model of com- [22] Song, C. et al. Quantum computation with universal er-
puters as represented by Turing machines. J. Stat. Phys. ror mitigation on a superconducting quantum processor.
22, 563–591 (1980). Sci. Adv. 5 (2019).
[4] Manin, Y. Computable and Noncomputable (in Russian) [23] Sun, J. et al. Mitigating realistic noise in practical noisy
(Sovetskoye Radio, Moscow, 1980). intermediate-scale quantum devices. arXiv:2001.04891
[5] Preskill, J. Simulating quantum field theory with a quan- (2020).
tum computer. arXiv:1811.10085 (2018). [24] Torlai, G., Mazzola, G., Carleo, G. & Mezzacapo, A. Pre-
[6] Cao, Y. et al. Quantum chemistry in the age of quantum cise measurement of quantum observables with neural-
computing. Chem. Rev. 119, 10856–10915 (2019). network estimators. Phys. Rev. Research 2, 022060
[7] McArdle, S., Endo, S., Aspuru-Guzik, A., Benjamin, (2020).
S. C. & Yuan, X. Quantum computational chemistry. [25] Assaraf, R. & Caffarel, M. Zero-variance principle for
Rev. Mod. Phys. 92, 015003 (2020). Monte Carlo algorithms. Phys. Rev. Lett. 83, 4682
[8] Bauer, B., Bravyi, S., Motta, M. & Kin-Lic Chan, G. (1999).
Quantum algorithms for quantum chemistry and quan- [26] Kandala, A. et al. Hardware-efficient variational quan-
tum materials science. Chem. Rev. 120, 12685–12717 tum eigensolver for small molecules and quantum mag-
(2020). nets. Nature 549, 242–246 (2017).
[9] Aspuru-Guzik, A., Dutoi, A. D., Love, P. J. & Head- [27] Kokail, C. et al. Self-verifying variational quantum sim-
Gordon, M. Simulated quantum computation of molec- ulation of lattice models. Nature 569, 355–360 (2019).
ular energies. Science 309, 1704–1707 (2005). [28] Klco, N. et al. Quantum-classical computation of
[10] Bharti, K. et al. Noisy intermediate-scale quantum (nisq) Schwinger model dynamics using quantum computers.
algorithms. arXiv preprint arXiv:2101.08448 (2021). Phys. Rev. A 98, 032331 (2018).
[11] Cerezo, M. et al. Variational quantum algorithms. Na- [29] Martinez, E. A. et al. Real-time dynamics of lattice gauge
ture Reviews Physics 3, 625–644 (2021). theories with a few-qubit quantum computer. Nature
[12] Peruzzo, A. et al. A variational eigenvalue solver on 534, 516–519 (2016).
a photonic quantum processor. Nat. Commun. 5, 1–7 [30] Borzenkova, O. et al. Variational simulation of
(2014). Schwinger’s Hamiltonian with polarisation qubits.
[13] Endo, S., Cai, Z., Benjamin, S. C. & Yuan, X. Hybrid arXiv:2009.09551 (2020).
quantum-classical algorithms and quantum error mitiga- [31] Brydges, T., Elben, A., Jurcevic, P. Vermersch, B.,
tion. J. Phys. Soc. Jpn. 90, 032001 (2021). Maier, C., Lanyon, Ben P., Zoller, P., Blatt, R., & Roos,
[14] Roffe, J. Quantum error correction: An introductory C.F., Probing Renyi entanglement entropy via random-
guide. Contemp. Phys. 60, 226–245 (2019). ized measurements. Science 364, 260-263 (2019).
[15] Dunjko, V. & Briegel, H. J. Machine learning & artifi- [32] Islam, R., Ma, R., Preiss, P. M. Eric Tai, M., Lukin,
cial intelligence in the quantum domain: A review of re- A., Rispoli, M., & Greiner, M., Measuring entanglement
cent progress. Reports on Progress in Physics 81, 074001 entropy in a quantum many-body system. Nature 528,
(2018). 77–83 (2015).
[16] Carrasquilla, J. Machine learning for quantum matter. [33] Huembeli, P. & Dauphin, A. Characterizing the loss
Advances in Physics: X 5, 1797528 (2020). landscape of variational quantum circuits. Quantum Sci.
[17] Torlai, G. et al. Neural-network quantum state tomog- Technol. 6, 025011 (2021).
raphy. Nat. Phys. 14, 447–450 (2018). [34] Park, C.-Y. & Kastoryano, M. J. Geometry of learning
[18] Becca, F. & Sorella, S. Quantum Monte Carlo Ap- neural quantum states. Phys. Rev. Research 2, 023232
proaches for Correlated Systems (Cambridge University (2020).
Press, 2017). [35] Bukov, M., Schmitt, M. & Dupont, M. Learning the
[19] Vaswani, A. et al. Attention is all you need. ground state of a non-stoquastic quantum Hamiltonian
arXiv:1706.03762 (2017). in a rugged neural-network landscape. arXiv:2011.11214
10

(2020).
[36] Hochreiter, S. & Schmidhuber, J. Long short-term mem-
ory. Neural Comput. 9, 1735–1780 (1997).
[37] Hibat-Allah, M., Ganahl, M., Hayward, L. E., Melko,
R. G. & Carrasquilla, J. Recurrent neural-network wave-
functions. Phys. Rev. Research 2, 023358 (2020).
[38] Sharir, O., Levine, Y., Wies, N., Carleo, G. & Shashua,
A. Deep autoregressive models for the efficient varia-
tional simulation of many-body quantum systems. Phys.
Rev. Lett. 124, 020503 (2020).
[39] Carrasquilla, J. et al. Probabilistic simulation of quan-
tum circuits with the Transformer. arXiv:1912.11052
(2019).
[40] Kingma, D. P. & Ba, J. Adam: A method for stochastic
optimization. arXiv:1412.6980 (2014).
[41] Carleo, G. & Troyer, M. Solving the quantum many-
body problem with artificial neural networks. Science
355, 602–606 (2017).
[42] Choo, K., Mezzacapo, A. & Carleo, G. Fermionic neural-
network states for ab-initio electronic structure. Nat.
Commun. 11, 1–7 (2020).
[43] Otis, L. & Neuscamman, E. Complementary first and
second derivative methods for ansatz optimization in
variational Monte Carlo. Phys. Chem. Chem. Phys. 21,
14491–14510 (2019).
[44] Bravyi, S. B. & Kitaev, A. Y. Fermionic quantum com-
putation. Annals of Physics 298, 210–226 (2002).
[45] Seeley, J. T., Richard, M. J. & Love, P. J. The Bravyi–
Kitaev transformation for quantum computation of elec-
tronic structure. J. Chem. Phys. 137, 224109 (2012).
[46] Bravyi, S., Gambetta, J. M., Mezzacapo, A. & Temme,
K. Tapering off qubits to simulate fermionic Hamiltoni-
ans. arXiv:1701.08213 (2017).
[47] Abraham, H. et al. Qiskit: An open-source framework
for quantum computing, 2019. https://qiskit.org (2019).
[48] Spall, J. C. Implementation of the simultaneous per-
turbation algorithm for stochastic optimization. IEEE
Trans. Aerosp. Electron. Syst. 34, 817–823 (1998).
[49] Krantz, P. et al. A quantum engineer’s guide to super-
conducting qubits. Appl. Phys. Rev. 6, 021318 (2019).
[50] Bennewitz, E. R., Hopfmueller, F., Kulchyt-
skyy, B., Carrasquilla, J. & Ronagh, P. 1QB-
Information-Technologies/NEM: 1QB-Information-
Technologies/Neural Error Mitigation (2022). URL
https://doi.org/10.5281/zenodo.6466405.
1

Supplementary Information:
Neural Error Mitigation of Near-Term Quantum Simulations

I. DETAILS OF THE TRANSFORMER NEURAL QUANTUM STATE

We use the Transformer [1] neural network architecture to represent neu-


ral quantum states. The Transformer can be applied to represent probability
Layer output
distributions and other functions over one-dimensional sequences of discrete (k+1)
data. At its core is self-attention, a mechanism that learns correlations by as- end
signing attention weights to each pair of positions (i.e., how much the output +

Skip connection
at one position of a given sequence should depend on the input at another
position). Each layer of the Transformer contains a self-attention component, ReLU
which, in turn, consists of several independent so-called self-attention heads.
Within a self-attention component, each of these heads have independent pa- Linear
rameters, and act on the layer inputs independently in parallel. The output of
the self-attention component is the concatenation of the outputs of each head. LayerNorm
In each head, for every pair of positions in the sequence, the attention weight
is computed from the inputs at those positions, using key and query vectors. +
The output at each position is a sum, weighted by the attention weights, of

Skip connection
the value vectors at all positions. To improve the representation power of ReLU
the model, the self-attention components alternate with linear layers, which
act on each position individually. To ensure the stability of the model, skip- Self-attention
connections and layer normalization are employed within each layer. A single
Transformer layer with all of its components is depicted in Fig. S1. LayerNorm
Recall from the Methods section of the main manuscript that the neural
quantum
p state maps a bitstring s ∈ {0, 1}N to a tuple (p(s), ϕ(s)), with hs|ψi =
p(s)e iϕ(s)
, where N is the number of qubits. The computational cost of +
Positional
calculating hs|ψi for a single computational basis state s is quadratic in the Layer input
encoding fnd
number of qubits, and alternative versions of the Transformer architecture (k)
end
improve this cost to being linear (e.g., [2]).
More specifically, the outputs p(s) and ϕ(s) are obtained from the bitstring s
as follows. The bitstring s is extended to s̃ = (0, s) by prefixing a zero bit. Each FIG. S1: Illustration of a Transformer
layer | The Transformer consists of several
bit s̃n of the input is encoded into a D-dimensional representation space using layers acting in sequence. The central
(0)
a learned embedding, yielding the encoded bit end with n ∈ {0, . . . , N } and components of a Transformer layer are the
self-attention block and linear layer. To
d ∈ {1, . . . , D}. Each index n is encoded using a learned positional encoding, enable a position-aware output, at each
yielding the encoded index fnd . The encoded input is processed using K layer a positional encoding of the index n is
identical layers. Each layer consists of a masked multi-head self-attention added to the input. To enhance stability, a
skip connection and layer normalization are
mechanism [1], with H heads, followed by a linear layer, with skip-connections employed.
and layer normalization. The weights of the attention matrices and the linear
layer are shared between the positions n, but not between the layers k.
The parameters of a single layer k are:
(k) (k) (k)
1. The query, key, and value matrices Qhid , Khid , and Vhid . The index h ∈ {1, . . . , H} labels the self-attention
heads. The index i ∈ {1, . . . , D/H} runs over a representation space of dimension D/H, where we require that
D/H be an integer. As before, we have d ∈ {1, . . . , D}.
(k)
2. A matrix to process the output of the self-attention heads, Ode , with d, e ∈ {1, . . . , D}.
(k) (k)
3. A weight matrix and a bias vector of the linear layer, Wde and bd , with d, e ∈ {1, . . . , D}.
The self-attention component acts as follows:
(k)
1. We denote the input to the component by iSA,nd , where n ∈ {0, . . . , N } and d ∈ {1, . . . , D}. Query, key, and
(k) P (k) (k) (k) P (k) (k) (k) P (k) (k)
value vectors are computed as qnhi = d Qhid iSA,nd , knhi = d Khid iSA,nd , and vnhi = d Vhid iSA,nd .
(k) P (k) (k)
2. Attention weights are computed as follows. Compute the attention scores snmh = i qnhi kmhi . Mask the
(k)
attention scores by setting snmh = −∞ whenever m < n. Compute the attention weights using the softmax
2

function given by
(k)
(k) exp(snmh )
wnmh = P (k)
. (S1)
m0 exp(snm0 h )

(k)
The masking ensures that wnmh = 0 whenever m < n.
(k) P (k) (k)
3. The output of each attention head is anhi = m wnmh vmhi . At each position n, concatenate the output of the
(k)
attention heads; that is, reshape the indices h and i into one index d, giving and .
(k) P (k) (k)
4. The output of the self-attention component is oSA,nd = e Ode ane . Due to the masking, the output at position
n depends on the inputs only at positions m ≤ n. We write the action of the entire self-attention component
(k) (k)
more abstractly as A(k) , that is, oSA = A(k) (iSA ).
(k) (k) P (k) (k) (k)
The linear component acts on an input iL,nd as oL,nd = e Wde iL,nd + bd , which we write more abstractly
(k) (k)
as oL = L(k) (iL ). Both components are wrapped with a skip-connection, layer normalization [3], and a ReLU
(k)
activation function. The output of the wrapped self-attention component A(k) , acting on an input ind , is
 
a(k) = G(A(k) , i(k) ) = i(k) + ReLU A(k) LayerNorm(i(k) ) , (S2)

where ReLU(x) = max(x, 0) is the ReLU activation function acting component-wise, and layer normalization is applied
(k)
on the last dimension of ind . We have introduced the notation G for the wrapping. The linear component L(k) is
wrapped in the same manner, giving e(k) = G(L(k) , a(k) ). In sum, the action of the entire Transformer layer is as
follows:
(k) (k−1)
1. The input to the k-th layer is ind = end + fnd .

2. The wrapped self-attention component is applied to give a(k) = G(A(k) , i(k) ).

3. The wrapped linear layer is applied to give e(k) = G(L(k) , a(k) ).

After the final Transformer layer, the outputs of the neural quantum state are obtained from the final representation
(K) (K)
enh as follows. Scalar-valued logits `n are obtained from enh using a linear layer, with weights shared among different
positions n. The logits are used to obtain conditional probabilities according to

p(sn = 1|s1 , . . . , sn−1 ) = σ(`n−1 ) and


p(sn = 0|s1 , . . . , sn−1 ) = 1 − p(sn = 1|s1 , . . . , sn−1 ) = σ(−`n−1 ), where n ∈ {1, . . . , N }, (S3)
1
and σ(`n−1 ) = 1+e−` n−1
is the logistic sigmoid function. The conditional probabilities give p(s) via the conditional
probability chain rule

N
Y
p(s) = p(sn |s1 , . . . , sn−1 ), (S4)
n=1

where p(s) is an automatically normalized probability distribution.


The factorization of the probabilities p(s), along with the fact that the neural network output at position n depends
only on the positions m ≤ n, may be leveraged to efficiently draw unbiased samples from the probability distribution
p. To do so, we proceed a single bit at a time. First, we compute p(s1 |s0 = 0), and sample the bit s1 from the
resulting probability distribution. We then compute p(s2 |s0 , s1 ) and sample the bit s2 from it, and so on, until all bits
have been sampled. The sampling is needed when training the NQS using VMC, as explained in the next section.
(K) (K)
The phase ϕ(s) is obtained by forming a vector E (K) = (e0 , . . . , eN ) by concatenating the final representations,
(K)
and projecting E to a scalar value using a linear layer. Our Transformer is implemented in PyTorch [4], and is
partially inspired by aspects of the implementations in Refs. [5, 6].
3

II. VMC TRAINING AND REGULARIZATION

In the final step of NEM described in the Methods section in the main manuscript, the neural quantum state is
trained using VMC. We add the regularizer
X
Lreg = −reg hs|ψ~λ i (S5)
s∈{0,1}N

to the loss function, where reg is a coefficient that is decreased over the course of training. The regularizer maximizes
the L1 -norm of the wavefunction, thereby discouraging sparsity in the early stages of training. For ground states
that are known to overlap greatly with one or a few ground states, this regularization technique encourages the NQS
to capture subdominant amplitudes rather than collapsing onto the dominant contributions (e.g., the Hartree–Fock
states, or the lattice Schwinger ground states at m → ±∞) early in the training process.
As with the energy, the regularizer and its gradient with respect to the parameters ~λ of the neural quantum state
|ψ~λ i need to be estimated from samples using the Monte Carlo method, giving the expressions

h −1 i 1X i
b −1
Lreg = − reg Es∼p~λ hs|ψ~λ i ≈ −reg hs |ψ~λ i and
b i=1
h −1 i 1 Xh i
b −1 i
∇θ Lreg = − reg Es∼p~λ hs|ψ~λ i ∇~λ <(lnhs|ψ~λ i) ≈ −reg hs |ψ~λ i ∇~λ <(lnhsi |ψ~λ i) . (S6)
b i=1

Here, < denotes the real part, the expectation values are taken over p~λ (s) = |hs|ψ~λ i|2 , si are samples from the same
distribution, and b is the batch size used in VMC.
The regularizer’s gradient is added to the energy’s gradient at every iteration. For completeness, we reproduce the
VMC algorithm here:
1. Draw b samples {s1 , . . . , sb } from p~λ (s) = khs|ψ~λ i|2 .

2. Compute the local Hamiltonians for all si ∈ {s1 , . . . , sb }:


X hsi |Ĥ|tiht|ψ~ i
Hloc (si ) = λ
(S7)
hsi |ψ~λ i
N
t∈{0,1}
hsi |Ĥ|ti6=0

3. Estimate the energy and its gradient with respect to the parameters ~λ of |ψ~λ i using

1 X 
b
E≈ < Hloc (si ) and
b i=1

2 hX i
b
∇θ E ≈ < (Eloc (si ) − E)∗ ∇θ lnhsi |ψ~λ i , (S8)
b i=1
p
where lnhs|ψ~λ i = ln p(s) + iϕ(s) in terms of the neural network’s outputs.
4. Estimate the regularizer and its gradient.
5. Add the energy’s gradient and the regularizer’s gradient, and update the parameters using the Adam optimizer.

III. EXPERIMENTAL VQE RESULTS FOR LiH

We can analyze the quality of the variational procedure implemented on the five-qubit IBMQ-Rome device by
looking at the energy optimization curves of the experimental results of VQE implemented for various bond lengths
of LiH (see Fig. S2). Generally, the optimization curves show that the experimental VQE procedure performs as
4

Calibration Ends 1.4 Å


1.2 Chemical Accuracy 1.548 Å
0.4 Å (bond length) 1.8 Å
1.0 0.6 Å 2.0 Å
0.8 Å 3.0 Å

∆E (hartrees)
1.0 Å 4.0 Å
0.8
1.2 Å

0.6

0.4

0.2

0.0
0 100 200 300 400 500
Energy Evaluations
FIG. S2: Experimental VQE energy optimization curves | Energy evaluations during SPSA optimization for LiH ground states at different
bond lengths using a five-qubit superconducting quantum computer. In order to compare optimization curves for all bond lengths, we plot the
energy difference to the ground state as a function of SPSA energy evaluations. Before optimization, the energy is estimated 25 times and used to
calibrate the initial step size of the algorithm. These energy values are reported to the left of the vertical dashed line. During each SPSA
iteration, the energy is estimated using 25 measurement bases with 1024 measurements per measurement basis. The parameters θ ~ are updated
250 times, resulting in 500 energy evaluations. Note that three bond lengths show incomplete convergence by the end of the optimization process,
with one bond length (1.8), still decreasing. Although VQE does not reach convergence for all bond lengths, neural error mitigation is able to
improve the results up to chemical accuracy, with low infidelity, as shown in the top row of Fig. 2 in the main manuscript.

expected, with energy optimization curves decreasing and converging to low energy estimates. However, we also see
that the simulation is unable to reach high-accuracy energy estimates, ∆E / 10−3 . This is due to the effects of noise
on, and limitations of, the variational quantum circuit [7]. Despite these errors, neural error mitigation is able to
extend all experimental results up to chemical accuracy, with low infidelities. We refer the reader to Table S1 for the
specific hyperparameters used in implementing VQE and NEM.

IV. QUANTUM STATE RECONSTRUCTION

An important feature of NEM is the ability to analyze the quality of the reconstructed neural quantum state at
various stages of the NEM algorithm. Since NEM employs NQS at the core of its construction, we have direct access to
the reconstructed quantum state through the probability, p(s), and phase, φ(s), distributions given by, p(s) = | hs|ψi |2
and ψ(s) = arg(hs|ψi). We analyze the reconstructed quantum state obtained after NQST as well as the final NEM
procedure for both of the systems studied in this paper, and compare the probability distributions modelled by the
neural quantum states to the VQE result. The VQE state is given by a density matrix ρ, because VQE is numerically
simulated using a density matrix simulator. The probability amplitudes of ρ are given by p(s) = Tr(ρ |si hs|). Since
we simulate VQE with noise, the VQE density matrix ρ describes a mixed state instead of a pure state. For mixed
states, the phase is not well-defined, and therefore not reported in Fig. S3 and Fig. S4.
In Fig. S3, we show the quantum states estimated at each stage of our process for the LiH ground states at a bond
length of l = 1.4 Å, prepared both experimentally (Fig. S3a) and numerically (Fig. S3b). We see that, for numerical
results, where we have access to the VQE quantum state, neural quantum state tomography accurately reconstructs
the optimized VQE state using the chosen measurement bases. For the experimental results, where we do not have
access to the final VQE quantum state, the neural quantum state trained using neural quantum state tomography
acts as an estimator of the final state expressed by the quantum device. At this stage, the neural quantum state
trained using NQST has not captured the exact ground state’s phase structure and inaccurately represents some of
the non-dominant computational basis states. After the NEM algorithm has been completed, the final NEM state
achieves accurate representations of both the probability distribution and phase for the computational basis states
whose exact probabilities are greater than 10−5 (or greater than 10−4 for experimental results). In the process of
improving the energy estimation achieved by VQE, NEM reconstructs and improves the ground-state wavefunction
itself. From another perspective, the classical ansatz trained through this process extends the “lifetime” of the
quantum simulation [8], allowing for its use in future work, without having to repeat the experiment.
Figure S4 shows a representation of the VQE state, as well as the neural quantum state after applying NQST
and after having completed NEM for the lattice Schwinger model. The phase structures of the lattice Schwinger
5

a b
100 100
Probability Exact NQST Exact NQST
NEM

Probability
NEM VQE
10−2 10−2

10−4 10−4

10−6 10−6
π π
Phase Error

Phase Error
π/2 π/2
0 0

−π/2 −π/2
−π −π
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Computational Basis State – sorted Computational Basis State – sorted

FIG. S3: Quantum states at each step in the neural error mitigation algorithm for LiH | The quantum states obtained by VQE (blue
triangles), neural quantum state tomography (green circles), and neural error mitigation (red diamonds) for the LiH ground state at a bond
length of l = 1.4 Å. In contrast to the neural quantum state, the VQE state is given by a density matrix representing a mixed state because VQE
is simulated using a noisy density matrix simulator. Whereas the probability distribution is given by p(s) = Tr(ρ |si hs|), the phase for a mixed
state is not well-defined, and therefore not reported. In each subfigure, the top panel shows the probabilities for the NQS given by p(s) = |hs|ψi|2
for each computational basis state |si. Computational basis states are sorted according to these probability amplitudes. The bottom panel shows
the phase error of the NQS relative to the ground state given by arg(hs|ψi) − arg(hs|ψg i), where the global phase is fixed to a phase error of zero
for the computational basis state that has the largest ground state probability. Neural error mitigation applied to an experimentally prepared
VQE result is shown on the left (a), and NEM applied to a numerically simulated VQE result is shown on the right (b). Under our qubit
encoding for the electronic structure problem, the Hartree–Fock state is mapped to a computational basis state and constitutes the dominant
contribution to the exact ground state corresponding to the 0-th state on the horizontal axis. This results in an exact probability distribution
that has a sharp peak, as shown traced by the black line. We see that the neural quantum states trained using NQST approximately reconstruct
the VQE quantum state’s probability distribution, for the numerically simulated results. In addition, the final probability distribution
represented by the NEM states very accurately represents the basis states with exact probabilities greater than 10−5 for numerical simulations
and greater than 10−4 for our experimental results. The phase errors are shown in the bottom panel, where we see that the final NEM quantum
state accurately reconstructs the phase for computational basis states with probabilities greater than 10−5 for numerical simulations and greater
than 10−4 for our experimental results.

ground states follow a sign rule and have real amplitudes. Although possible, we do not enforce the sign rule or any
symmetries in our neural quantum states in order to show the general applicability of NEM. While the fidelity of the
NQST state with respect to the ground state is only 0.71, the errors in the complex phases that correspond to the
computational basis states with non-zero amplitudes are relatively small, and mostly confined to the range [− π2 , π2 ].
Given that converging to an accurate phase structure is one of the main difficulties encountered in training a neural
network using VMC [9], the NQST state may provide a good starting point from which it could be easy for the
VMC algorithm to converge to a good approximation of the ground state. After the NEM algorithm has completed,
the final NEM state achieves an accurate representation of both the probability distribution and the phases of the
lattice Schwinger ground state, specifically for the computational basis states whose exact probabilities are greater
than 10−5 .

V. COMPARISON OF NEM TO VMC

The key observation outlined in this paper is that NQST and VMC can be combined to form a post-processing
error-mitigation strategy for ground-state preparation when the two procedures are conducted using a common neural
network ansatz. In addition to analyzing how well NEM improves the results obtained from noisy quantum simulations,
it is also useful to compare NEM to its classical counterpart: training a neural quantum state using only the variational
Monte Carlo algorithm, hereafter referred to as standalone VMC. We compare the performance of both methods as
a function of VMC batch size, which is the number of samples used in estimating the gradient and updating the
neural network’s parameters during VMC (see Section II). Given that we fix the total number of iterations used in
training, the batch size is indicative of the classical computational resources required. Note that increasing the batch
size decreases the variance of the energy’s gradient estimate. For larger systems, a large batch size is often required in
order to reach chemical accuracy, which imposes a bottleneck on the possible applications of VMC [10]. Another way
to increase the amount of classical computational resources, which may exhibit different scaling, would be to increase
the number of iterations at a fixed batch size.
In Fig. S5 we compare the energy error and infidelity of NEM performed on the experimental VQE result and
standalone VMC, for the LiH molecule at a bond length of l = 1.4 Å (the same experimental VQE data is presented
in the top row of Fig. 2 in the main manuscript). For NEM, we fix the outcome of the first stage of NEM (i.e., the
neural quantum state trained via NQST) and then train the VMC component of NEM using different batch sizes to
6

100

Probability
Exact NEM
NQST VQE
10−3

10−6

Phase Error
π/2
0
−π/2
−π
0 50 100 150 200 250
Computational Basis State – sorted
FIG. S4: Quantum states at each step of the neural error mitigation algorithm for the N = 8 lattice Schwinger model | The
quantum states obtained by VQE (blue triangles), by NQST (green circles), and after completion of NEM (red diamonds) are shown, for the
lattice Schwinger model at m = −0.7. Similar to Fig. S3, the top panel shows the probabilities of each basis state sorted according to their
probability amplitudes, and the bottom panel shows the phase errors with respect to the exact ground-state phase. In contrast to the NQS
results, the VQE state is given by a mixed-state density matrix. Therefore, we report the VQE probability distribution but not the phase. Due to
the symmetries present in the lattice Schwinger ground state, many computational basis state amplitudes are zero in the exact ground-state
probability distribution (solid black line). While the probabilities of the VQE state somewhat follow the ground state, the VQE state
overestimates many of the computational basis states’ probabilities that do not contribute to the exact ground state. This is explained by the
effects of noise and because the quantum circuit employed only approximately preserves the symmetries of the model. We can see that NQST
successfully reconstructs the VQE state, accurately modelling the probability distribution for the computational basis states that contribute to
the exact ground state. Additionally, NQST provides an estimate of the phase, showing a phase error confined mostly to the range [−π/2, π/2]
for the computational basis states that make non-zero contributions to the exact ground state. At the end of NEM, the final probability
distribution represented by the NEM state accurately represents the probabilities and phases of the computational basis states that have exact
probabilities greater than 10−5 .

investigate how the energy error and infidelity of the final NEM state scales. The results show that NEM performed
using the experimental results provides an advantage over using only VMC. NEM achieves chemical accuracy using
a lower batch size than VMC. In addition we note that NEM requires a smaller regularizer compared to conducting
the training using only VMC. Note that the ground states of LiH, a molecular system that can be mapped to four
qubits, can be feasibly solved using current classical methods.
For the lattice Schwinger model, we also study the performance of NEM and standalone VMC as a function of
system size and batch size (Fig. S6). We show that while the best results for standalone VMC are comparable to the
best results of NEM, standalone VMC has a lower success rate, especially for larger systems and smaller batch sizes.
Conversely, NEM reliably converges to a good approximation of the ground state. We speculate that the state found
by VQE, which is approximated by NQST, provides a good initialization for training using VMC, and explains the
improved convergence rate of the NEM algorithm. While we expect that, for system sizes presented, hyperparameter
tuning could potentially improve the results of standalone VMC, we speculate that the increased stability of NEM
over standalone VMC will persist at larger system sizes.
In order to understand the advantages of using quantum resources in conjunction with classical methods, future
research must be conducted to explore whether using the NEM algorithm for preparing ground states, such as those
for larger systems in quantum chemistry and lattice theories, converges to classical representations of quantum states
that are outside the reach of standalone VMC. This speculation stems from the fact that both NEM and standalone
VMC train a neural quantum state using the VMC algorithm, with the difference being that NEM initializes the
VMC algorithm using a classical representation of an experimentally prepared quantum state. One approach could
be to determine whether the NEM algorithm captures features of the experimentally prepared quantum state, such
as superposition and entanglement, in its initial neural quantum state representation and whether it retains these
features throughout the classical training process. We also speculate that exploring the loss landscape of VMC
can help to delineate the boundary between classically solvable ground-state preparation problems and those that
require quantum resources. In other words, we ask whether initializing VMC using a classical representation of
an experimentally prepared quantum state relaxes classical resource requirements of the VMC algorithm, such as
the exponentially large amount of memory needed to describe high-fidelity ground-state representations. The work
presented in this paper provides a framework for investigating the fundamental differences and potential synergy
between quantum and classical information processing.
7

a b

10−1
|∆E| (hartrees) 10−1
10−2
10−2

Infidelity
10−3
10−3
10−4
CA VQE
10−5 10−4
HF NQST NEM VMC
10−5
24 25 26 27 28 24 25 26 27 28
VMC Batch Size VMC Batch Size

FIG. S5: Comparison between neural error mitigation and variational Monte Carlo for quantum chemistry | The performance of
neural error mitigation (red diamonds) on experimental VQE results compared to the performance of training a neural quantum state using
standalone VMC (orange crosses) is shown. Both methods are compared as a function of the VMC batch size for the LiH ground state for l = 1.4.
Both methods are performed using a neural quantum state that has both the same architecture hyperparameter values. The median results for
VMC are shown for 10 runs, and the shaded regions show the interquartile range. In (a), the chemical accuracy (CA) and Hartree–Fock (HF)
energy error are reported. For LiH, NEM applied to the experimental results outperforms standalone VMC, achieving chemical accuracy at a
batch size at a factor of two earlier than standalone VMC.

100
10−1
10−2
∆E/N

10−3
10−4
10−5

100
10−1
Infidelity

10−2
10−3
10−4
VQE NEM: b = 28 NEM: b = 210 NEM: b = 212 NEM: b = 214
10−5 NQST VMC: b = 28 VMC: b = 210 VMC: b = 212 VMC: b = 214

6 8 10 12 14 16 6 8 10 12 14 16 6 8 10 12 14 16 6 8 10 12 14 16
System Size System Size System Size System Size
FIG. S6: Comparison between neural error mitigation and variational Monte Carlo for the lattice Schwinger model | Performance
of standalone VMC on the lattice Schwinger model, compared to the performance of NEM, for various system sizes N and batch sizes b. While
the best results achieved by standalone VMC are no worse than the best results achieved by NEM, standalone VMC does not consistently
converge to a high-quality approximation of the ground state. Neural error mitigation is more stable, especially at smaller batch sizes and larger
system sizes. Each panel shows the median results, and the shaded regions show the interquartile ranges. The hyperparameter values used for
standalone VMC are the same as those used in the scaling analysis shown in Fig. 3e and Fig. 3f of the main manuscript, and are listed in the
column “> 8 sites” of Table S1.

VI. COMPARISON OF NEM ACROSS DIFFERENT LEVELS OF NOISE IN VQE

The results presented in this paper have shown that NEM can improve estimates of ground states and ground state
properties obtained from VQE on noisy devices and in noisy simulations by supplementing these noisy VQE results
with classical simulation methods. However, for high levels of noise, we expect the results obtained from VQE to
contain less information about the true ground state. Beyond a certain level of noise, we expect that supplementing
VQE with NEM should not outperform an NQS trained using purely classical methods such as VMC. In order to
determine the regime where the combination of VQE and NEM holds the promise of improvement over purely classical
methods, we study the performance of NEM as a function of VQE noise levels.
We consider the eight-site Schwinger model, at the critical point m = −0.7, and compare the performance of NEM
for VQE noise rates ranging from zero noise to complete depolarization. The VQE circuit used is the same as the
results shown in Fig. 4a of the main manuscript, and has an initial state of |01 · · · 01i. After each layer of the global
entangling operation or single-qubit rotations, a depolarizing channel with a variable rate λ ∈ [0, 1] is applied to each
8

101 100 1.0

100

Infidelity
VQE 10−1 0.1

tr(ρ̂2 )
∆E

NQST
10−1 NEM
10−2
0.01
10−2 2−8

0 10−3 10−2 10−1 100 0 10−3 10−2 10−1 100 0 10−3 10−2 10−1 100
Depolarizing Error Rate Depolarizing Error Rate Depolarizing Error Rate
FIG. S7: Comparison of the performance of NEM across different levels of noise in VQE | Performance of neural error mitigation
applied to VQE results for the eight-site Schwinger model at the critical mass m = −0.7. Single-qubit depolarizing noise gates of various noise
rates λ are applied after every operation in the simulation of the VQE circuit. The NEM performance when λ = 1 can be attributed to classical
resources only, since the output of VQE corresponds to a state close to the maximally mixed state. At λ ≤ 10−2 , corresponding to a purity of
tr(ρ̂2 ) ≥ 0.82, the results of NEM show a clear improvement over those of classical resources alone. Shown is the median over 100 runs. The
shaded region represents the interquartile range.

qubit. The hyperparameters for NQST and VMC are the same as in the main manuscript.
At a depolarizing error rate of λ = 1, the VQE results are highly mixed and thus contain no information about
the true ground state or its properties, as noise completely dominates the simulation. While even in the completely
depolarizing simulation, NEM yields an improvement over VQE, its performance can be attributed to that of VMC
which uses only classical computational resources. Thus, the results obtained when λ = 1 can be used as a benchmark
for the performance of standalone VMC. The results in Fig. S7 show that at noise rates of λ = 10−2 or lower,
corresponding to a median purity tr(ρ̂2 ) ≥ 0.82 of the VQE density matrix, the energy error and infidelity of the
NEM state yield a clear improvement over the NEM results when λ = 1. These results underscore the fact that NEM
used in conjunction with VQE shows an improvement over a computation that uses only classical resources given that
the quality of the quantum resources meets a minimum threshold for the system studied.

VII. HYPERPARAMETER VALUES

We present the hyperparameter values of our numerical studies in Table S1.


9

Lattice Lattice H2 LiH


Schwinger Schwinger
(8 sites) (> 8 sites)

Variational Quantum Eigensolver


Iterations 200 200 250 250
Post-rotation circuits 3 3 4 25
Shots per basis 512 1024 1024 1024
SPSA parameters:
a0 0.1 0.1 calibrated† calibrated
c0 0.1 0.1 0.1 0.1
α 0.602 0.602 0.602 0.602
γ 0.101 0.101 0.101 0.101
A 10 20 0 0
Neural Quantum State
Transformer parameters:
Layers 2 2 2 2
Heads 4 4 4 4
Internal dimension 8 12 8 8
Parameter count 890 1766–2006 794 826
Neural Quantum State Tomography
Bases 15 2N − 1 4 11
Samples per basis 512 512 300 500
Batch size 512 512 128 128
Learning rate 10−2 10−3 10−2 10−2
Epochs 50 30 100 100
Time (Single CPU) < 1 minute 8 minutes (16 sites) 30 seconds 6 minutes
Variational Monte Carlo
Iterations 400 3200 1000 1200
Batch size 512 28 − 214 256 256
Learning rate 10−2 3 × 10−3 then decreased 10−2 10−2
by 10 times after 1600 and
2400 iterations
Regularization 0.1 for 200 iterations, then 25.6/(2N ) decreasing lin- 0.05 for 600 iterations, then 0.05 for 600 iterations, then
0 early for 1000 iterations, 0 0
then 0
Time (Single CPU) < 1 minute 6 hours (16 sites) 10 seconds 20 seconds

TABLE S1: Hyperparameter values for neural error mitigation components | Presented are the hyperparameter values of the neural
quantum state, VQE training, neural quantum state tomography, and variational Monte Carlo for each system studied. † The a0 parameter for
each H2 and LiH variational circuit is calibrated [7] in Qiskit’s VQE function.
10

[1] Vaswani, A. et al. Attention is all you need. arXiv:1706.03762 (2017).


[2] Wang, S. et al. Linformer: Self-attention with Linear Complexity. arXiv:2006.04768 (2020).
[3] Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. arXiv:1607.06450 (2016).
[4] Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. arXiv:1912.01703 (2019).
[5] Dai, Z. et al. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv:1901.02860 (2019).
[6] Parisotto, E. et al. Stabilizing Transformers for reinforcement learning. In International Conference on Machine Learning,
7487–7498 (PMLR, 2020).
[7] Kandala, A. et al. Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets. Nature
549, 242–246 (2017).
[8] Torlai, G. et al. Neural-network quantum state tomography. Nat. Phys. 14, 447–450 (2018).
[9] Szabó, A. & Castelnovo, C. Neural-network wavefunctions and the sign problem. Phys. Rev. Research 2, 033075 (2020).
[10] Choo, K., Mezzacapo, A. & Carleo, G. Fermionic neural-network states for ab-initio electronic structure. Nat. Commun.
11, 1–7 (2020).

You might also like