PRXQuantum 2 010328

PRX QUANTUM 2, 010328 (2021)
Quantum Enhancements for Deep Reinforcement Learning in Large Spaces

Sofiene Jerbi ,1,* Lea M. Trenkwalder ,1 Hendrik Poulsen Nautrup ,1 Hans J. Briegel ,1,2
and Vedran Dunjko 3
1
Institute for Theoretical Physics, University of Innsbruck, Technikerstr. 21a, Innsbruck A-6020, Austria
2
Department of Philosophy, University of Konstanz, Fach 17, Konstanz 78457, Germany
3
LIACS and LION, Leiden University, Niels Bohrweg 1 Leiden, CA 2333, Netherlands
(Received 6 October 2020; accepted 12 January 2021; published 22 February 2021)
Quantum algorithms have been successfully applied to provide computational speed ups to various
machine-learning tasks and methods. A notable exception to this has been deep reinforcement learning
(RL). Deep RL combines the power of deep neural networks with reinforcement learning, and has pro-
vided some of the most impressive recent artificial-intelligence (AI) results including the famous AlphaGo
system—yet, no possibilities for quantum advantages have been identified to date. In this work, we show
how quantum computers can enhance the performance of deep RL, especially where the action spaces
are large. Specifically, we introduce so-called deep energy-based models, inspired by statistical physics,
which we show outperform standard deep RL machinery in learning performance. These models are com-
putationally more demanding, but this can be ameliorated by quantum techniques. Specifically, we provide
quantum algorithms, some of which can be run on near-term quantum computers, that can be used to speed
up deep energy-based RL. This result opens up a new playing field for quantum enhancements in machine
learning and AI.
DOI: 10.1103/PRXQuantum.2.010328
I. INTRODUCTION settings have also stimulated the developments of new

quantum algorithms [8,15,16].
Over the course of the last years, machine learning (ML)
However, not all areas of machine learning have been
has emerged not just as a promising domain of appli-
equally amenable to quantum enhancements, with the area
cation for quantum computing [1–5], but is often listed
of deep reinforcement learning (RL) being a critical excep-
as one of its potential “killer applications” [6]. In the
tion. Deep RL combines the capacities of deep neural net-
contexts of classification and big data problems, genuine
works (NNs), with the interactive learning framework of
quantum effects have been exploited to speed up compu-
reinforcement learning (see Fig. 1) and it underlies some of
tational bottlenecks stemming from either genuinely hard
the most famous results in modern AI, such as the AlphaGo
computational tasks [7,8] or from data volume and dimen-
[17,18] system. It is becoming the go-to method in the
sionality [9–11]. More recently, new types of machine-
hardest problems including automated driving [19], drug
learning algorithms for classification and generative mod-
design [20], but also in machine-assisted research [21,22],
eling have been put forward to maximize the potential of
quantum control [23–25], quantum sensing [26], quantum-
noisy intermediate-scale quantum (NISQ) [12] devices. In
algorithm design [27], and quantum error correction
this flavor of quantum ML, both quantum generalizations
[28–30]. Nonetheless, despite its obvious importance,
of classical methods [13], and genuinely quantum models
near-celebrity status, and numerous challenges which are
[14] have also been proposed.
still to be tackled [31], deep RL to date has not been
Beyond being a fruitful domain of application for
successfully quantum enhanced.
existing quantum algorithms, the varied machine-learning
In this work, we demonstrate quantum enhancements for
a class of deep reinforcement learning, relying on ideas
inspired by statistical physics and by combining them with
*
sofi[email protected] known and new quantum algorithms. In more detail, we
specify classes of neural-network-based RL algorithms,
Published by the American Physical Society under the terms of
which achieve better learning performances than conven-
the Creative Commons Attribution 4.0 International license. Fur-
ther distribution of this work must maintain attribution to the
tional methods when the state and action spaces of the
author(s) and the published article’s title, journal citation, and learning algorithm are large. This is the case in many
DOI. applications, e.g., with continuous degrees of freedom.
2691-3399/21/2(1)/010328(32) 010328-1 Published by the American Physical Society

SOFIENE JERBI et al. PRX QUANTUM 2, 010328 (2021)
We achieve this by utilizing computationally demand-

ing underlying learning models, which, as we show, can
enjoy significant quantum enhancements, in terms of speed
ups and in terms of natural quantum generalizations. Our
models are based on the ideas introduced by LeCun in
Ref. [32], and the quantum algorithms involve accelerated
sampling from hard distributions and variational circuit
ideas.
Contributions. Our main results are threefold.
1. We introduce deep energy reinforcement learn-

ing, and demonstrate, using numerical experiments, FIG. 1. Quantum-enhanced reinforcement learning. Agent and
its superior learning performance over conven- environment maintain a cyclic interaction consisting of states (or
tional methods in large-action-space environments. percepts) that the agent uses to deliberate on its actions according
Specifically, it attains near-optimal results, whereas to its policy, and perceives feedback on its behavior in the form
of rewards. In the quantum-enhanced framework, the agent is
the standard methods using models with similar
granted access to a, generally universal, quantum computer to
parameter numbers fail. enhance its deliberation process.
2. We provide a general quantum-enhanced learn-
ing algorithm for deep energy-based RL, which
utilizes quantum-sampling subroutines, and other standard NN architectures used in deep RL in Sec. IV.
quantum-algorithmic methods to speed up the learn- Section V introduces a hybrid quantum-classical algorithm
ing process. For these purposes, we also provide to train energy-based models for RL and describes the
the currently most efficient method for the quan- quantum subroutines it relies on. Finally, in Sec. VI, we
tum Gibbs sampling problem, which has further discuss our results and present an outlook.
applications in, e.g., training quantum Boltzmann In Secs. II and III, we take a detailed didactic approach
machines [13,33]. to the construction of our principal energy-based model:
3. We specify quantum generalizations of classical the deep energy-based network (DEBN). In this process,
energy-based models, which can be trained using we draw interesting connections between the stan-
our algorithm, and which can have further advan- dard feedforward NNs used in deep RL [deep Q net-
tages over conventional methods by their capacity works (DQNs)] and a relatively unexplored energy-based
to capture more complex action-state distributions. approach relying on restricted Boltzmann machines. The
purpose of these sections is to better situate our model with
Our setting shares similarities with approaches previously respect to the fields it relates to and to setup a compre-
proposed in generative machine learning, which utilize hensive framework for deep energy-based RL that allows
quantum algorithms to speed up sampling [34,35], or rep- for a straightforward generalization to quantum models.
resent hard distributions using quantum circuits [36–39]. In Sec. V, we provide such generalizations. We refer the
It thus also has the potential of establishing provable reader who may already be familiar with energy-based
quantum-sampling separations [40,41]. However, in RL, models and (deep) RL directly to Sec. III B for a short
one is not simply interested in representing or sampling introduction to our DEBNs, before proceeding to the main
from a fixed target distribution. Instead, agents should be contributions of our paper in Secs. IV and V.
able to represent an entire sequence of distributions (i.e.,
policies). These should generate useful interactions with II. BACKGROUND AND PRELIMINARIES
the environment and slowly evolve to an optimal behav-
In this section, we introduce the well-studied case of
ior, as we explain in more detail shortly. These specificities
RL environments with large state and action spaces and
of the RL scenario make it a richer setting, which calls
discuss modern approaches [42,43] that attempt to deal
for new application-specific algorithms and has poten-
with such scenarios by contrasting them with traditional
tial quantum advantages not present in bare generative
RL methods [44,45].
modeling tasks.
The paper is organized as follows: Sec. II serves
as an introduction to the well-known problem of large A. Tabular reinforcement learning methods
state and action spaces in reinforcement learning and the RL is a framework for the design of learning agents in
(in)adequacy of different methods to deal with this issue. interactive environments (see Fig. 1). We refer to Ref. [44]
Section III presents a step-by-step construction of the deep for a description of the RL formalism, which we do not
energy-based networks we consider. We evaluate their per- require in full detail here. The goal of an agent in RL is
formance on benchmarking tasks and compare them to to learn a policy π(a|s) that specifies its actions a given
010328-2
QUANTUM ENHANCEMENTS FOR DEEP REINFORCEMENT... PRX QUANTUM 2, 010328 (2021)
perceived states s such that this policy maximizes its future large (potentially continuous) state and action spaces.
rewards in a given task environment. In this work, we focus However, this is a rather common aspect of practical
on two particular methods for the design of RL agents. RL tasks. Large and continuous state spaces appear, for
instance, in vision problems, where camera images con-
(a) Methods deriving from dynamic programming stitute the state space. Large action spaces are common
focus on a so-called action-value function Q(s, a) in many control problems of large processes, including
to link policy and expected rewards. These value- quantum control and quantum experiments [21,23,24,29,
based methods (VBMs) [44] include, for instance, 30,47–54], where the continuous degrees of control are
Q-learning and SARSA. commonly discretized into actions to efficiently search for
(b) Projective simulation (PS) [45] is a physics-inspired their optimal values. While generalization mechanisms do
model for agency [46] that relies on weighted not overcome the intractable lower bound complexities of
connections in a directed graph to reflect experi- learning in large spaces [55], practical RL environments
enced rewards and derive a policy. These weights allow certain generalization heuristics to learn much faster,
are referred to as h values. In its simplest (i.e., as we explain shortly.
two-layered) form, the underlying weighted graph
contains only directed connections from states to B. Quantum approaches to reinforcement learning
actions, resulting in state-action h values h(s, a).
To date, quantum enhancements for RL have mostly
To unify the treatment of these two approaches, we been considered in the context of tabular methods or
define the notion of merit functions M , of which both auxil- when the environments allow genuine quantum inter-
iary functions Q and h are instances. These are real-valued action. A review of recent advances can be found in
functions defined on the state-action space that assign a Ref. [56].
certain value of merit to all actions in any given state. In the cases where quantum interaction with the environ-
A standard procedure to learn merit functions and their ments is possible, quantum enhancements can be proven
associated policies is the so-called tabular method: scalar in terms of the sample complexity (i.e., interaction steps)
values M̂ (s, a) are stored for each state-action pair and are needed to learn a good policy. For instance, quadratic
updated through interaction with the environment. This speed ups in sample complexity have been shown using
interaction is governed by the normalization of the merit methods derived from Grover search [57] and phase esti-
values into a policy, e.g., by using a Gibbs-Boltzmann mation [58] [55,59–61]. Proof-of-concept environments
distribution, also called “softmax” distribution in ML: for which exponential separations in sample complexity
have also been showcased [62].
eβ M̂ (s,a) With respect to enhancements of the algorithmic part of
π(a|s) = , (1) the learning agent, speed ups have only been suggested
β M̂ (s,a )
a e for special models [63,64] and by building on the system
where β > 0 is an inverse temperature hyperparameter. of D-wave [65,66]. While the work of Ref. [63] provides
VBMs and PS each define an update rule that allows the a theoretical guarantee of quadratic speed up (as it relies
values M̂ (s, a) associated with experienced state-action on quantum-walk algorithms similar to the ones we use in
pairs to be iteratively updated such that rewarded actions Sec. V C), the works that rely on the D-wave system do not
become more likely to be taken at the next encounter with provide any guarantee of speed up.
this state. For instance, in the case of Q-learning (a VBM): In our work we construct a more general framework,
centered around the generalization capacities of the learn-
ing model underlying reinforcement learning. For that, we
Q̂[s(t) , a(t) ] ← (1 − α)Q̂[s(t) , a(t) ]
focus on the most successful approach for generalization
+ α{r(t+1) + γ max Q̂[s(t+1) , a]}, (2) in RL, based on neural networks.
a
where a(t) is the action performed by the agent in the state C. Generalization with neural networks
s(t) at timestep t, which lead it to the state s(t+1) and associ- A common way to provide RL agents with general-
ated reward r(t+1) ; α is a learning rate taking value in [0, 1] ization mechanisms is through function approximation
and used to deal with statistical fluctuations. methods, closely related to regression problems in super-
The fundamental problem with tabular approaches is vised learning. Such approaches introduce a parametrized
their lack of generalization capabilities: there is no mech- representation M θ (s, a) of the merit function, replacing
anism enabling an agent to generalize learned values to an otherwise sparsely filled merit table by a function
unobserved states and actions. Since merit-value updates that is defined on the entire state-action space. Updates
act on single state-action pairs at a time, learning with tab- of the merit function act no longer on individual values
ular methods becomes intractable for environments with M̂ (s, a) but on a set of parameters θ, allowing one to
010328-3
of the neural network followed by the application of a sin-

gle additional softmax layer that normalizes all the merit
values of its output.
However, this representation suffers from shortcomings
when it comes to tasks with large action spaces, as we elab-
orate. During the first stages of learning, it is important for
the agent to keep an explorative behavior to not get stuck
in a possibly suboptimal policy. For this reason, the func-
tion approximator should be able to capture several of the
local maxima of the merit function, which may originate
from complex state-action correlations. In other words, the
representation should be multimodal. However, it is rea-
sonable to expect a neural network that can represent only
merit values that are linear combinations of (nonlinear)
FIG. 2. Sketch of comparison between the layouts of DQN- state features to be unlikely to learn highly multimodal
like networks [42] and the DEBN we consider. The DQN-like merit functions. This is why several works have cho-
model (top) estimates the merit function for all actions simulta- sen to avoid DQNs in environments with large action
neously while the DEBN (bottom) estimates the merit function spaces [68,69].
for a single state-action pair. The first hidden layers may be We refer to representations of type (ii) as deep
convolutional. energy-based networks due to their interesting connection
to so-called energy-based models, as explored in machine
learning [32]. Inspired by statistical physics, energy-based
indirectly update the values M θ (s, a) of many—including models represent a target function using the energy func-
unobserved—state-action pairs simultaneously. Using the tion of a statistical system (that can be in one of com-
terminology of supervised learning, {M θ }θ constitutes the binatorially many configurations, each having a certain
set of all possible hypothesis functions and learning con- energy). When normalized, this energy function gives
sists in updating the parameters θ as to find the setting rise to a probability distribution over all possible config-
that best approximates the true merit function M . Due to urations. An example of an energy-based model is the
their success in supervised and unsupervised learning, neu- Boltzmann machine, introduced by Hinton and Sejnowski
ral networks, parametrized by their weights, have gained [70], which is inspired by the Ising model for spin systems
popularity as function approximators in RL. As we clarify [71]. Restricted Boltzmann machines (RBMs), i.e., Boltz-
next, while generalization over the state space is shared by mann machines with a further constrained set of energy
all function approximation methods based on neural net- functions, were the first energy-based models reportedly
works, generalization over the action space, i.e., learning a used in RL with the work of Sallans and Hinton [43],
useful similarity measure between actions, is less common. followed by various extensions [64,68,72–74]. Interest-
When using feedforward neural networks as function ingly, although RBMs are commonly associated to a
approximators, one can choose between two different class of so-called stochastic recurrent neural networks,
approaches to represent the merit function (see Fig. 2). it turns out that they can be expressed as feedforward
The first representation
(i) takes as input a state s and neural networks of type (ii) [75] with a single hidden
outputs all values M θ (s, a1 ), . . . , M θ (s, a|A| ) associated layer. The DEBNs we consider here are then straight-
to that state; while the second (ii) takes as input a sin- forwardly obtained by adding additional hidden layers
gle state-action pair (s, a) and outputs its correspond- to this feedforward neural network [76]. Energy-based
ing merit value M θ (s, a). Each of these representations models are also known to be good at representing mul-
has its advantages in different contexts, that we clarify timodal distributions [68,77]. Hence, since DEBNs are
below. able to encode deep and multimodal representations, they
In their seminal work [42], Mnih et al. introduce a so- are well suited for tasks with both large state and action
called DQN of type (i) that quickly became the standard in spaces.
value-based deep RL [67]. DQNs spiked interest notably A major drawback of DEBNs is the high cost of sam-
due to their success in vision problems, e.g., playing Atari pling from the policy (1): |A| evaluations of the network
games, as their so-called convolutional hidden layers made are needed to evaluate the policy exactly. Approximate
them perfectly suited to deal with the high-dimensional sampling methods based on random walks, such as Gibbs
space of the images they receive as input. Importantly for sampling, can speed up the sampling process. However,
our later analyses, utilizing a network of type (i) makes Long et al. [78] showed that approximate sampling is still
it particularly efficient to evaluate the Boltzmann policy of an NP-hard problem in the case of RBMs, an argument that
the agent [Eq. (1)], since it requires only a single evaluation can be trivially generalized to DEBNs.
010328-4
In this work, we experimentally explore these intu- modified, to best fit an (empirical) probability distribu-
itively motivated advantages of deep energy-based models tion through updates of the parameters θ. These updates
in complex RL environments and show how quantum algo- effectively select new energy functions from {E θ }θ , and,
rithms can help mitigate the computational bottleneck that equivalently, new generative distributions.
they introduce. In order to capture more complex dependencies between
the variables, an additional set of auxiliary variables h,
III. DEEP ENERGY-BASED RL called hidden or latent variables, can be added to the set of
so-called visible variables v. The model has now an energy
In this section, we give a short introduction to energy- E θ (v, h) that includes terms characterized by the parame-
based models, which we build upon to design our deep ters between visible and hidden variables. The probability
energy-based learning agents. We first consider RBM distribution over the visible variables is then specified by
function approximators for RL [43] and present an estab- a function F θ (v) called the free energy of the model due
lished connection between RBMs and shallow feedforward to its connection to the equilibrium free energy in statisti-
networks [75], which allows us to consider a natural exten- cal physics. It is obtained by tracing out (or marginalizing)
sion to a deep architecture—our so-called DEBNs. We the hidden variables from the normalizing distribution. For
show that this function approximation method is compati- instance, in the case of a softmax normalization, we have
ble with both VBM and PS update rules. This construction
thereby also provides an extension of PS to NN function θ (v,h) θ
θ e−E e−F (v)
approximation, which we call deep PS, in analogy to deep P (v) = h
= (4)
Q-learning. v ,h e−E θ (v ,h ) v e
−F θ (v )
and
A. Energy-based RL
1. Energy-based models F θ (v) = E θ [v, hP(h|v) ] + H (h|v), (5)
Energy-based models (EBMs) are a popular class of
probabilistic models that find applications in both super- where hP(h|v) is the expectation value of the hidden vari-
vised and unsupervised learning [32]. They are often used ables under P(h|v) [defined similarly to distribution (4)]
as generative models, that is, to model the empirical prob- and H (h|v) is their Shannon conditional entropy under this
ability distribution of observed data vectors {v}v∈D , e.g., same probability distribution. A derivation of Eq. (5) is
strings of binary variables. EBMs assign a scalar energy provided in Appendix B.
value to each configuration of these variables, e.g., all pos-
sible binary strings of length |v|, whose structure should 2. Energy-based function approximation
reflect dependencies (i.e., correlations) between the vari- Following the idea of Sallans and Hinton [43], we use
ables. This energy function is usually parametrized by a the free energies of energy-based models as parametrized
vector θ that determines the strengths of these dependen- approximations of merit functions (such as the Q function
cies. In the case of a Boltzmann machine, for instance, the in VBMs or the h values of PS), that is, we take M θ to
energy function takes the form: be defined by the model’s −F θ . We start with a simple
energy-based model: a restricted Boltzmann machine, fur-
−E θ (v) = wij vi vj + bi vi , (3) ther described in Appendix B. To represent a policy using
i<j i a RBM, we divide its visible units into state and action
units (see Fig. 3). This allows one to define a conditional
probability distribution:
where θ = (wij )1≤i<j ≤|v| , (bi )1≤i≤|v| . Note that this par-
ticular Boltzmann machine is said to be fully visible, since θ θ
its energy depends only on the visible variables vi that e−F (s,a) eM (s,a)
π θ (a|s) = = (6)
encode the data (further explained shortly). The topol- −F θ (s,a ) M θ (s,a )
a e a e
ogy of the model specifies which variables are assumed
(in)dependent. Combined, topology and parametrization effectively encoding the policy of an agent as prescribed by
constrain the family of available energy functions {E θ }θ Eq. (1) (the inverse temperature β can usually be absorbed
that an EBM can represent. by the parameters θ ).
A probability distribution over the configuration space This approximation of the merit function by the RBM
can be derived from the energy function by a certain free energy, as opposed to a tabular approach, has the
normalization over all possible configurations (example advantage of being defined on the entire state-action space,
below). Normalization assigns a high probability to con- which allows the agent to act nonrandomly on previ-
figurations with low energy and vice versa. This genera- ously unseen states. Moreover, (implicitly) stored values
tive probability distribution can be trained, i.e., iteratively M θ (s, a) are not updated independently from one another,
010328-5
FIG. 3. Restricted Boltzmann machine used for reinforcement

learning. The visible units are divided between state and action
units. Their connections to hidden units are weighted by a weight
matrix W. Biases are not represented for compactness.
as updates on θ act on many values at a time, which allows

one to generalize learned merit values to unobserved states FIG. 4. Feedforward neural network computing the exact free
and actions. energy F of a RBM. W and b are the vectors of weights and biases
At first sight, this form of energy-based function approx- parametrizing the RBM. The activations of the hidden units are
imation may look completely unrelated to an approach softplus functions (see Appendix B). These activations and the
relying on feedforward neural networks, such as the DQN input bias are summed up by the output unit.
of Mnih et al., making their structure hard to compare.
Indeed, the expression of the RBM free energy given update rule [44] given by
in Eq. (5) does not seem to be expressible as a neural-
network function, i.e., linear combinations of (input) vari- −F θ [s(t) , a(t) ] ← −(1 − α)F θ [s(t) , a(t) ]
ables interlaid by nonlinear activation functions. But a
quick derivation, given in Appendix B, shows that the + α{r(t+1) − γ max F θ [s(t+1) , a]}, (7)
a
free energy of a RBM can be evaluated exactly by a
shallow feedforward neural network taking the form of where −F θ corresponds to a Q-function approximation
Fig. 4. We come back to this interesting connection in this case and α ∈ [0, 1] is the so-called learning rate.
in the next subsection when we consider deep neural Notice that this expression can be viewed of the form
networks.
current ← (1 − α) current + α target, (8)
3. Updating the merit function where the target value is an approximation of the optimal
Q value (expressed in its bootstrapped expansion in terms
Updates on a RBM are typically performed through of the current reward and the discounted Q value of the
gradient descent applied on the weights θ parametrizing next timestep) that becomes exact upon convergence to the
its (free) energy. When used in an unsupervised learning optimal policy.
scenario, the descent direction is given by the gradient Stated differently, this update aims at decreasing the
of the Kullback-Leibler divergence (also known as rela- so-called temporal difference error, i.e., the difference
tive entropy) between the RBM distribution P θ (v) and the between target and current value:
empirical data distribution p̂(v) [79,80]. In a reinforce-
θ
ment learning scenario, however, we do not have direct ETD (t) = r(t+1) − γ max F θ [s(t+1) , a] + F θ [s(t) , a(t) ].
a
access to (samples from) the optimal policy. Instead, the
agent receives feedback from the environment in the form Analogously, we can derive similar expressions for
of delayed rewards. That is, feedback is given only after SARSA:
long sequences of actions, preventing the direct evaluation
θ (t+1)
of any measure of distance to the optimal policy. For this ETD (t) = r − γ F θ [s(t+1) , a(t+1) ] + F θ [s(t) , a(t) ]
reason, the update rule of the RBM used in RL is derived
from the update rule of the merit function that the model and PS (see Appendix B):
approximates. To explain what we mean, we detail how
θ
this is done in the case of Q-learning, which has a tabular EPS r(t+1) + γPS F θ [s(t) , a(t) ]
(t) =
010328-6
It is then straightforward to derive an update rule for the 1. Deep energy-based network
weights of the RBM [θ (t+1) = θ (t) + θ ] by performing In contrast to shallow neural networks, deep neu-
2
gradient descent on the squared error E θ (t) [81]: ral networks are strictly more expressive: the Vapnik-
Chervonenkis dimension [84], a measure of function
α 2 approximators’ expressive power [85], of neural networks
θ = − ∇θ E θ (t) with ReLU activation functions (similar to the softplus
2
function we consider here, see Appendix B) grows linearly
= −α E θ (t)∇θ E θ (t) with their number of hidden layers when keeping their total
≈ −α E θ (t)∇θ F θ [s(t) , a(t) ]. (9) number of weights fixed [86]. This added expressive power
can give some intuition on how come deep neural networks
require only logarithmically many artificial neurons com-
It turns out that, in the case of RBMs, these update values pared to their shallow counterparts to represent a certain
can be computed efficiently since the gradient of their free general class of functions [87]. Given the greater repre-
energy ∇θ F θ (v) has a tractable expression: sentational power of deep neural networks, we come back
to the connection between the RBM-based approach and
⎧ θ feedforward neural networks made in the previous section.
⎨ ∂F (v) = −vi hk P(h|v)
∂wik To enable the representation of a richer class of merit
(10)
⎩ ∂F θ (v) = −v ; ∂F θ (v)
= −hk P(h|v) , functions, we extend the free-energy network in Fig. 4 to
∂bi i ∂bk
a deep architecture by duplicating its hidden layer. The
resulting neural network represents the free energy (i.e.,
where hk P(h|v) = 1/ 1 + exp − i wik vi − bk . the energy after tracing out the stochastic hidden units)
Hence, these formulas allow us to adapt RL update of a so-called deep energy model (DEM) (see Fig. 5),
rules to train the free energies of RBMs as merit-function first introduced by Ngiam et al. [76]. As pointed out by
approximators. the authors, this model is different from both deep Boltz-
mann machines (DBMs) [88] and deep belief networks
(DBNs) [89] due to its directed deterministic connections
B. From shallow to deep: the deep energy between its first layers, followed by a single undirected
models for RL stochastic layer. Conceptually, this model transforms state-
The RBM-based model presented in the previous sub- action pairs given as input into nonlinear features that are
section fulfills one of the goals behind generalization in then treated as the visible units of a RBM. This construc-
RL: having a parametrized representation of the merit tion preserves the tractability of the free energy inherited
function that allows us to generalize learned behavior to from the RBM (as opposed to DBMs and DBNs) while
previously unobserved states and actions. A natural con- representing high-level abstractions of state-action pairs.
sideration that arises from this approximation method is Hence, DEBNs constitute models with great representa-
a possible compromise in representational power: a tab- tional power without the associated drawback of having
ular method can represent arbitrary real-valued functions a large number of weights to train.
over the state-action space (assuming it is finite), but there Same as for the RBM, updates of the merit function are
is no guarantee, a priori, that the RBM free energy can performed through gradient descent on the temporal dif-
represent these functions with finitely many hidden units. ference error [see Eq. (9)] with respect to the weights of
Fortunately, a result from Le Roux and Bengio [82] shows the DEBN. However, due to the deep architecture of the
that RBMs are universal function approximators: a RBM neural network, we have to resort to backpropagation [90]
can represent arbitrary probability distributions over its to propagate derivatives of the error through the multiple
visible units with any given precision, using a finite num- layers of the DEBN. It can be easily shown that back-
ber of hidden units that depends on this distribution and the propagation of the temporal difference error on the shallow
desired precision. Unfortunately, these universality results DEBN considered in the previous subsection leads to the
also suggest that certain policies may need a number of same update rule as in Eqs. (9) and (10).
hidden units that is linear in the number of state-action Prior to the DQN of Mnih et al. [42], it was widely
pairs to be represented precisely [83]. Similarly, repre- believed that nonlinear function approximation using
senting arbitrary merit functions using a RBM free energy (deep) neural networks was not suitable for learning value
(Fig. 4) has a comparable cost in the number of hidden functions due to its high instability, which is likely to
units [75]. This leads to a neural network parametrized by a lead to divergence from the optimal value function [91].
very large number of weights, rendering it difficult to train This instability originates from the approximation of the
and likely to get stuck on suboptimal policies for general target value function [see Eq. (8)] that leads to a nonsta-
environments. tionary temporal difference error and from the important
010328-7
FIG. 5. The DEBN we consider (left) and its corresponding DEM (right). The DEM is obtained by removing the summation unit at
the output of its corresponding DEBN and by replacing its last hidden layer with a layer of stochastic units of the same size. Just as
in a RBM, these stochastic units can take binary values depending on the input given by their undirected connections. Biases are not
represented for compactness.
correlations between the training samples generated by the state spaces but simple target merit functions, while
agent. These two features, arising in RL yet not in super- tabular methods would fail to learn as fast. This
vised learning, both violate the assumptions behind con- is due to the ability of (even shallow) neural net-
vergence proofs in (un)supervised learning. Mainly two works to capture the simple correlations of the merit
additions enabled stable learning with DQNs: (1) an expe- function, enabling it to generalize on the large input
rience replay memory storing (recent) samples of learning space.
experience, used to generate decorrelated training data for (2) We expect shallow neural networks with a restricted
the neural network; (2) a target network, i.e., a copy of number of hidden units to be unable to learn com-
the trained DQN updated at a slower rate, used to supply plex merit functions. A separation in learning per-
consistent targets for the temporal difference error during formance between shallow and deep energy-based
updates of the primary DQN. Naturally, we also rely on networks should then be noticeable when dealing
these tools to stabilize learning with our DEBNs. with more complex task environments.
A description of the full learning algorithm is presented We test hypotheses (1) and (2) using common
in Algorithm 1 of Sec. V. benchmarking tasks, introduced in Sec. IV A.
In the second part of this section, we inves-
tigate the advantage of energy-based function
IV. PERFORMANCE COMPARISON BETWEEN approximators when facing large action spaces.
VARIOUS MODELS Our hypotheses stem from previous experimen-
Having introduced how energy-based models can be tal results showing the improved exploration per-
used for function approximation, we now justify their use formance of energy-based policies when facing
in RL. To that end, we study the differences in performance multimodal reward functions in RL [68,69], as
of both standard and energy-based function approximation well as results in generative modeling showing
methods for RL when facing environments with large state the ability of energy-based models to produce
and action spaces. In a similar fashion to the analysis of robust multimodal representations of large image
Fu et al. [92], we establish several hypotheses regarding datasets [77].
the performance of these models with respect to a num- (3) We postulate that DEBNs are better suited than
ber of learning aspects appearing in RL. These hypotheses standard DQNs (see Sec. II C) to learn in RL envi-
are based on structural considerations along with previ- ronments with large action spaces: given the same
ous experimental evidence. We then describe and carry out number of parameters, we expect DEBNs to learn
experiments to test our hypotheses. richer representations that allow for better general-
Let us list our hypotheses, numbered (1) to (4) below. ization performance when learning merit functions.
We start by demonstrating that, similarly to a standard This should be due to the ability of the DEBN
RL scenario based on feedforward NNs, the use of func- architecture to easily learn complex state-action cor-
tion approximation and especially deep models is also relations by having nonlinear hidden layers that act
beneficial when considering an energy-based approach. on both states and actions. Since in RL an agent
has to rely on its own approximation of the merit
(1) We expect very simple (shallow) energy-based net- function to explore the environment and update
works to quickly learn in environments with large itself, we expect the RL learning setting to amplify
010328-8
(a) (b)
(c)
one layer
two layers (d)
one layer five layers
FIG. 6. Numerical evidence of the advantage of (deep) function approximation models in a 100 × 100 GridWorld task and the
CartPole-v1 environment. (a),(b) In both plots, the optimal performance (198 and 500 steps, respectively) is indicated by the orange
line. The remaining curves indicate the average performance and standard deviations of 50 agents trained using a PS update rule. (a) In
the 100 × 100 GridWorld task, the performance of tabular agents (red curve) remains close to that of a random walk on the grid while
agents with a shallow neural network (blue curve) reach close to optimal performance. (b) In the Cartpole-v1 environment, agents with
a shallow neural network (red curve) exhibit close to random behavior while neural networks with at least two hidden layers achieve
close to optimal performance. (c) Illustrative depiction of the GridWorld environment. Note that this image is purely illustrative as it
does not reflect the actual size and layout of the grid used in the simulations. (d) A game-screen image of the CartPole-v1 environment.
poor generalization performance in DQN-type properties are a consequence of statistical fluc-

architectures. We isolate two learning aspects that tuations of sampled trajectories and their asso-
are specific to a RL scenario and that we believe ciated rewards (main source of error in PS), as
characterize these amplifying effects: well as an imperfect approximation of the merit
function associated to the current policy (main
(a) Policy sampling. We refer to the first aspect source of error in SARSA and Q-learning).
as the policy-sampling property of the train-
ing process. It comes from the fact that an To test our hypothesis (3), we design two exper-
agent’s policy influences the experienced data iments that isolate each of these learning aspects.
that constitutes its training set. Since this policy In each experiment, we test the generalization per-
is derived from the agent’s approximation of the formance of DEBN and DQN agents for increasing
merit function, a model that lacks to generalize dimensions of the action space, for which we expect
properly its learned merit function will then ren- to see a growing separation in performance between
der the training set suboptimal and hence limit the two models. Our experiments have the speci-
the agent’s learning performance. ficity that the target merit function to be learned
(b) Reward discounting. The second aspect comes by the agents is fixed and clearly defined, which
from the update rules of RL. As opposed to a allows us to measure the error of the agents’ approx-
supervised learning scenario, the target values imations, that we refer to as generalization error.
used to train the merit-function approximators We also run a control experiment verifying that
are not fixed but themselves computed using the learning aspect is indeed responsible for the
an approximation of the merit function. For observed separation.(4) Our last hypothesis posits
instance, for SARSA: that these gaps in generalization abilities manifest
in a separation in learning performance for a suffi-
θ
θ
θ ciently complex RL environment, e.g., with a high
ETD = r + γ M (s , a ) − M (s, a),
number of good suboptimal policies. We give evi-
dence of this claim in such a RL environment
and for PS: where both effects highlighted above contribute to
the learning performance.
θ
EPS r − γPS M θ (s, a).
=
The experiments that test hypotheses (1) and (2) are
presented in Sec. IV A, while the experiments that test
The trained model has then to cope with a hypotheses (3) and (4) are presented in Sec. IV B.
noisy and inaccurate target function [93]. These The Python code used to run these experiments is
010328-9
openly available in a GitHub repository [94] and the (c) the pendulum is successfully balanced for 500 consec-
hyperparameters used in these experiments are listed in utive timesteps. For every timestep the agent manages to
Appendix F. balance the pole, it receives a reward of 1. The state space
of this task can be described by four continuous parame-
A. Shallow versus tabular and deep versus shallow ters: the position of the cart, the velocity of the cart, the
angle between the vertical line above the pivot point and
In this subsection, we numerically demonstrate the the pole, and the angular velocity of the pole. In Fig. 6(b),
importance of function approximation and deep models in we compare the performance of architectures with one,
RL environments with large and continuous state spaces. two, and five hidden layers but the same total number of
The results are presented in Fig. 6. weights. We observe that the network with a single hid-
den layer does not manage to learn the complex optimal
1. GridWorld simulations policy of this task. However, already two hidden layers
In order to demonstrate the advantage of shallow are sufficient to achieve near-optimal performance, demon-
energy-based networks over tabular methods, we com- strating the importance of deep energy-based models in RL
pare their performance in the GridWorld benchmarking tasks.
task [95] [see Fig. 6(c)]. In this task, the agent navigates
through a two-dimensional array of cells by choosing, at B. DEBN VERSUS DQN
each timestep of interaction, one of the four main cardinal
directions to move from one cell to the next. The training is After considering RL environments with large state
divided into trials. At the beginning of each trial, the posi- spaces, we now focus on environments with large action
tion of the agent is initialized at a fixed starting cell (0, 0) spaces, for which we expect an advantage of the DEBN
and a reward is placed in a fixed goal cell (n, n), where n is architecture over DQN-like networks (see Fig. 2). In order
the grid size. A trial terminates if the agent reaches the goal to keep a fair comparison between the two models, we
cell or if the length of the trial exceeds 20 000 timesteps. constrain our neural networks in two ways. First, we set
In the former case, the agent receives a reward of 1. In their total number of parameters to be equal whenever their
the latter, the agent receives a reward of −1. For the sim- performance is compared. Moreover, for these models to
ulations presented here, we choose a 100 × 100 grid with be practically relevant, we restrict their total number of
closed boundary conditions and no obstacles such that the parameters to a reasonable bound (constant in our exper-
state space is made large while maintaining a simple opti- iments, but in practice simply significantly smaller than
mal policy, close to being state independent (going up or the state-action size). However, these constraints combined
right with equal probability, except at the boundaries). This lead to a particular regime where the separation between
last property brings a very helpful but yet very simple way the two architectures is straightforward: as the size of the
to generalize the learned merit function and policy on the action space grows, the restricted parameters of the DQN
large state space. In Fig. 6(a), we demonstrate that agents tend to accumulate in its output layer (which is linear in
using a shallow neural-network architecture significantly the number of actions) until the model becomes unus-
outperform their tabular counterparts in this task. This able, while the DEBN architecture is significantly less
improved performance attests the generalization capabil- affected (since its input layer can grow as slowly as log-
ities provided by the neural network. The tabular methods arithmically with the action space). Since this bottleneck
are unable to detect and exploit the similarities between effect would cause DEBNs to trivially outperform DQNs,
different cells. we choose our bounds on the number of parameters such
that our instances of DQNs stay out of the bottleneck
regime.
2. CartPole simulations Let us now introduce in more detail the RL-specific
We now show, by means of an example, that DEBNs aspects that we expect to cause a separation between the
may provide advantages over shallow architectures in two architectures and the experiments we perform to test
more complex task environments. More specifically, we this claim.
consider the OpenAI Gym CartPole-v1 environment [96],
which features a continuous state space. In CartPole, the
goal is to balance an inverted pole pendulum mounted on 1. Policy sampling
a movable cart by applying to the latter, at each timestep The first aspect comes from the policy-sampling prop-
of interaction, a unit of force to the left or right [see erty of the training process: at each interaction step, an
Fig. 6(d)]. A trial terminates in one of three situations: (a) agent samples one action a from a policy specified by its
the angle between the vertical line above the pivot point current approximation M θ (s, . . .) of the merit function, and
and the pole exceeds 12◦ , (b) the center of mass of the updates this approximation according to a loss that is (in
cart is at a horizontal position outside the range [−2.4, 2.4], general) independent of the values M θ (s, a ), M (s, a ) for
010328-10
FIG. 7. The policy-sampling experiment. At each learning step, a state is uniformly sampled from the state space (of size 256). Its
corresponding binary encoding is perceived by the agent, which samples an action from a Boltzmann policy computed using its current
approximation of the merit function (initially random). In the policy-sampling experiment, the target value associated to this sampled
state-action pair is compared to the approximation of the neural network, resulting in the loss used to train its weights. In the analog
multidimensional regression problem, the entire row of target values associated to that state is compared to the values approximated by
the network, each contributing to the loss. We additionally use mini batches of states of size 10 to compute average losses and improve
convergence rates. The DEBNs also receive binary encodings of the actions.
different actions a . As opposed to a supervised (multidi- caused by the bottleneck effect, as previously discussed.
mensional) regression task, where, for each sampled state However, we show that this separation indeed originates
the
s, agent could compute its loss on the entire vector from the policy-sampling aspect of training by running
M θ (s, a1 ), . . . , M θ (s, a|A| ) , the policy-sampling aspect a control experiment on the analog supervised learning
can lead to masking effects on sampled target values. task, where we indeed observe that both architectures
In particular, when the current approximation incorrectly have the same performance up to 256 actions (after which
assigns large values to certain actions, this approximation the bottleneck effect might be responsible for part of the
error skews the sampling distribution in an inopportune separation).
way and prevents learning the merit function faithfully
on the entire state-action space. Since sampling from the
correct data distribution is already an imperative to get 2. Discounted rewards
good generalization error in regression tasks, it should as The second aspect of RL for which we expect a sepa-
well be important in the more general setting of policy ration comes from the reward discounting introduced by
sampling. We expect these masking effects to be particu- RL update rules. For both projective simulation and value-
larly relevant when the target merit function is multimodal, based methods, merit values do not correspond to the
as this multimodality (in general) adds complexity to the immediate rewards of state-action pairs but also include
state-action correlations to be learned. Intuitively, since estimates of future rewards discounted by a discount (or
DEBNs take some (compact) representation of individual glow) factor [see bootstrapped expression of Eq. (8)]. This
actions as input, they are able to construct nonlinear fea- is an intrinsic property of RL update rules as they all aim at
tures of both states and actions. Hence, this makes DEBNs optimizing long-term rewards. For VBMs, these estimates
more likely to learn complex state-action correlations and of future rewards are given by a sample of the immediate
masks, in a way that is not possible for a network that reward and the merit value of the next experienced state
can only construct state features in its hidden layers. To as approximated by the model; while, for PS, these cor-
investigate the magnitude of this effect, we compare the respond to the rewards collected during one sampled tra-
learning performance of these two models in the policy- jectory in the environment. For stochastic policies and/or
sampling learning task described in Fig. 7. Interestingly, environments, the sampled terms lead in general to statis-
we observe that the DQNs learn a very noisy approxi- tical fluctuations in the target merit values on which the
mation of the target function, and even completely fail agents compute their loss. As for the approximated terms,
to learn one of the five modes of the target function. In errors in the approximation of the merit function lead to
order to give more quantitative results, we test the gener- noise in the target values. We refer to both these effects
alization performance of the two models on the same task as fluctuations of the merit values. To study the influence
with different levels of discretization of the action space. of this RL aspect in isolation from other sources of fluc-
As we can see in Fig. 8(a), the separation in testing loss tuations, we consider the following task: an agent is given
appears to increase linearly with the size of the action access to sample rewards experienced by a fixed stochas-
space. Here, one could argue that this separation is solely tic policy (also called off policy) in a toy environment
010328-11
(a) (b)
FIG. 8. Separation in testing loss between DEBN and DQN agents in the policy-sampling and reward-discounting experiments.
(a),(b) In both plots, each bar indicates the average testing loss (i.e., generalization error) of 20 agents in their respective tasks. The
number of weights of both neural networks is kept constant and equal in each experiment. Hatched bars are associated to the control
experiments where the learning setting does not have the tested property. (a) In the policy-sampling experiment, we test the ability
of both architectures to faithfully represent a multimodal merit function (depicted in Fig. 7) when training samples are generated
by the current approximation of said function. The testing loss corresponds to the average smoothed l1 loss over all merit values
associated to 1000 randomly sampled states after 5000 learning steps. (b) In the reward-discounting experiment, we test the ability of
both architectures to cope with the statistical fluctuation on their target merit values brought by a SARSA update rule. The testing loss
corresponds to the average l2 loss over all merit values associated to 100 randomly sampled states after 2000 learning steps (20 000
total sampled transitions and an update period of 10).
where states are randomly sampled at each interaction step, factor. This task environment has the same description
and needs to learn its discounted merit function. The con- in terms of states and actions as the one used for the
straints of this task effectively cancel the policy-sampling discounted-reward experiment, but differs in its transition
aspect considered in the previous experiment as well as function such that it adds more structure to the task. The
temporal fluctuations induced by the stochastic policy at description goes as follows: an agent is situated in a one-
hand. The setting for which we observe the highest sep- dimensional GridWorld with N cells, enumerated 0 to
aration between the two models is for an off policy very N − 1, circular boundary conditions and a goal state at
close to optimal (performing a rewarded action with prob- position N /2 . At each interaction step, the agent is at
ability 0.99 and a random action otherwise) and with a a given position n and can do one of ( N /2 + 1)2 actions.
discount factor of 0.9. Again, to assess the influence of Each action is described by two subactions: moving i steps
reward discounting on this separation, we run a control to the left and j steps to the right for i, j in 0, . . . , N /2 .
experiment where the discount factor is set to 0. The results Call n the position n + j − i mod N /2 . From the actions
of these simulations are presented in Fig. 8(b). Note that for which n corresponds to the goal position N /2 , only
we observe this separation despite the use of the learning the shortest (i.e., smallest i + j ) is rewarded with a reward
mechanisms that help counter the statistical fluctuation of of 1. All other actions are not rewarded. Now for the tran-
target values: notably, the experience replay memory and, sition function: if n = N /2 , the agent is moved to the
more importantly, the target network (see Sec. III B). position n + 1 (or n + 2 if that happens to be the goal
position), otherwise it is moved to the position n . This
choice of transition function, while quite artificial, leads
3. A complete RL environment to the interesting property that this task allows for sev-
In the previous experiments, we study the separations eral suboptimal policies, e.g., the agent runs into a loop
between DEBNs and DQNs in terms of their generaliza- of l ∈ {2, . . . , N − 2} positions with only one nonrewarded
tion error when learning merit functions for restricted RL transition, achieving an average reward of 1 − 1/l ≥ 0.5.
environments. We now give evidence of a separation in The only optimal policy on the other hand, runs through
learning performance (i.e., average collected rewards) in a loop of length N − 1 and achieves an average reward of
a complete RL environment. Here, agents learn to solve a 1. This makes it a very challenging task if one seeks the
toy GridWorld task with a special transition function by optimal policy, and in part explains why the separation in
directly interacting with it (policy sampling) and are sub- performance between the two architectures appears small.
ject to the statistical fluctuations induced by a discount We do however observe in the results of our experiment
010328-12
FIG. 9. Separation in average rewards between DEBN and DQN agents in a RL task. Each dot on the right plot indicates the average
reward of 20 agents over 100 trials after training for 1900 trials of length 100 interactions. The left plot shows the evolution of this
average performance over the course of training for the toy circular-GridWorld task with 225 actions. Very simple suboptimal policies
can achieve an average reward of 0.5, represented by the red line. The optimal policy achieves an average reward of 1, indicated by
the green line.
(see Fig. 9) an increasing separation in average rewards next section applies to problems with large action spaces
with respect to the number of actions in the environment. and hence can not be numerically demonstrated as classical
This illustrates how the gaps in generalization performance simulations are intractable.
observed for the RL-specific learning aspects discussed
above translate into a separation in learning performance V. HYBRID DEEP ENERGY-BASED RL WITH
when these aspects act in concert. QUANTUM ENHANCEMENTS
So far, we have shown that one instance of energy-based
4. Comments on the experiments models, the DEBNs, can provide a learning advantage over
Due to the simplicity of the chosen tasks, the separa- standard DQNs in complex RL environments. However,
tions observed in our experiments with DEBNs and DQNs this learning advantage comes at a cost: the inefficiency
may get less significant in different learning regimes. of sampling from energy-based policies, needed to act on
Similarly to tabular methods being able to learn optimal the environment and train the RL agent. It is easy to see
policies given arbitrarily long learning time, we expect the that exact sampling is intractable, since it requires com-
observed separations to get smaller when looking at longer puting the merit values M θ (s, a) for all possible actions
learning times, or, for instance, when increasing the total a given a state s to evaluate the expression of Eq. (1),
number of weights of the networks and hyperparameters which costs O(|A|) feedforward evaluations in the case
such as the update period of the target networks. How- of DEBNs. Moreover, this hardness of sampling even
ever, throughout our exploration of relevant regimes, we extends to approximate sampling: it is actually an NP-
did not observe any inversion of the performance of both hard problem to generate samples from a distribution that
networks, indicating that a large separation might as well is provably close in total variation distance from a general
be present in broader regimes for more complex environ- RBM distribution [78]. We address this computational bot-
ments. We use a PS update rule for the simulations on the tleneck by applying quantum algorithms for approximate
benchmarking tasks and a SARSA update rule for the sepa- sampling from Gibbs distributions, that is algorithms that
ration results in the reward-discounting and full-RL exper- can offer quantum speed ups to heuristic sampling meth-
iment. We however believe that the same experiments with ods. We investigate the potential of using both (long-term)
interchanged update rules exhibit a similar behavior, pos- fault-tolerant algorithms with provable quadratic speed ups
sibly with different separations. This claim is supported by [97–99], as well as near-term approximate optimization
an extensive numerical study that showed that the tabular approaches [100–102].
versions of PS and VBMs behave similarly in these types We have focused in our simulations on DEBNs rather
of benchmarking tasks [95] and additional numerical simu- than other energy-based models due to their interesting
lations that were not included in this paper for conciseness. properties: they offer deep NN-based representations of
In order not to penalize the DEBN agents with approxima- the merit function while allowing efficient computation of
tion errors in sampling, we use exact sampling (see next this function and its gradient (through a forward and back-
section) from the policy of the agents in the experiments ward evaluation of the DEBN). A drawback of DEBNs
above. Note that the quantum speed up presented in the is that, in order to be evaluated in quantum algorithms,
010328-13
FIG. 10. Pseudocode of the algorithm used to train energy-based agents in RL environments. Possible quantum enhancements are
highlighted by Theorems 1, 2, and 3.
one needs to carry out the feedforward computation coher- In this section, we introduce a hybrid quantum-classical
ently, which is out of reach of near-term quantum devices. scheme for training deep energy-based models (Algorithm
Hence, this encourages us to additionally consider differ- 1 in Fig. 10). We list several classical and quantum
ent deep energy-based models such as deep Boltzmann models that can be used in this hybrid scheme, and present
machines as well as their quantum extensions (QBMs) that quantum subroutines to speed up sampling and the evalua-
have the advantage of being easily simulatable. In view of tion of the approximate merit function and its gradient.
their similar properties, we expect these models to be as
robust to the learning aspects that hinder DQN-type mod-
els. With DBMs and QBMs however, we face the issue that A. Description of the algorithm
the evaluation of the merit function (i.e., the negative free In Algorithm 1, we give a general procedure to train RL
energy of the BM after tracing out its hidden units), along agents with energy-based merit-function approximators.
with its gradient, is intractable for a large number of hid- This procedure applies to both VBM and PS update rules
den layers and units. The reason behind this intractability and takes into account common mechanisms to stabilize
is similar to the one behind the hardness of sampling from learning, notably a replay memory and a target network
RBMs or DEBN policies: in both cases, one needs to com- (see Sec. III B). These mechanisms, along with the interac-
pute the free energy of the model after tracing out mutually tion with the environment, are kept classical. Throughout
dependent (i.e., connected) units, which is a hard task. training though, the agent needs to perform three subrou-
However, similarly to the task of sampling from RBMs tines that are amenable to quantum speed ups:
and DEBN policies, approximate sampling algorithms can
be used to estimate the free energy of DBMs and its 1. sampling from a distribution specified by the current
gradient. approximation of the merit function;
010328-14
TABLE I. Applicability of different sampling algorithms to several energy-based models. F θ corresponds to the maximum free
energy of the model, while H θ is its maximum clamped energy, i.e., the energy after tracing out the hidden units or simply fixing
the input units, respectively. K counts the number of hidden units of the model while A counts its number of action units. β is the
inverse temperature of the sampled Gibbs distribution and δ the spectral gap of the Gibbs sampling Markov chain using for sampling
(when applicable). The complexities of exact sampling come from basic linear algebra techniques, while the complexities of classical
Gibbs sampling are due to lower bound results on mixing times of Markov chains [103] (see also Appendix A). The complexities for
the quantum algorithms are proven in Theorem 1.
Sampling → Classical Near/Mid-term quantum Fault-tolerant quantum

Variational Quantum
Exact Gibbs Quantum Gibbs state simulated Hamiltonian
Model ↓ sampling sampling annealers preparation annealing simulation
√
βF θ
Classical DEBN O(2A ) O 1
δ
✗ ✗ O δ
O 2A βF θ
√
βH θ
DBM O(2AK ) O 1
δ
(heuristic) (heuristic) O δ
O 2AK βH θ
θ 2
√
Quantum QBM O(23AK ) ✗ ✗ (heuristic)
O β 2 H
√
O 2AK βH θ
δ
2. estimating the approximate merit values associated B. Examples of suitable energy-based models
to experienced states and sampled actions; Before discussing our quantum subroutines, we start by
3. estimating the gradient of the approximate merit describing the energy-based models studied in this section.
function for experienced state-action pairs. Having already introduced in Sec. III restricted Boltzmann
machines, deep energy models and their associated free-
So far, we have been considering DEBNs, for which sub-
energy DEBNs, we now focus on DBMs and QBMs. As
routines 2 and 3 can be efficiently performed classically
is common in statistical and quantum physics, we switch
and hence do not require quantum speed ups. For differ-
from energy functions to a description in terms of Hamil-
ent energy-based models however, this can no longer be
tonians, involving here Pauli σ x and σ z matrices. Similarly
the case as they generally involve computing expectations
to Sec. III, we define our merit-function approximators
values with respect to intractable probability distributions.
M θ (s, a) using the negative free energy −F θ (s, a), i.e., the
Nonetheless, we show in Sec. V D that these two subrou-
energy after tracing out hidden units of these models.
tines can be performed using the same sampling algorithms
of subroutine 1, presented in Sec. V C. To see this, note that
the free energy of an energy-based model takes the form
1. Deep Boltzmann machines
DBMs [88,104] are generalizations of RBMs with addi-
F(v) = E(v, h)P(h|v) + S(h|v), tional hidden layers. They are described by a diagonal
Hamiltonian of the form:
that is the sum of the expected energy and Shannon entropy

θ
HDBM =− wi,k1 σiz σkz1 − wj ,k1 σjz σkz1
(or von Neumann entropy in the quantum case) of the
i,k1 j ,k1
model under the conditional Gibbs distribution P(h|v).
Both these terms can be estimated by sampling from this − wkη ,kη+1 σkzη σkzη+1 − bl σlz (11)
same distribution. As for the gradient of the free energy, η,kη ,kη+1 l
we show that for the models we consider, these gradients
take the form of expected values of the model parameters
under the distribution P(h|v). parametrized by a real-valued vector θ = (wi,k1 , wj ,k1 ,
Essentially, at the core of these three subroutines is a wkη ,kη+1 , bl )i,j ,kη ,l and where we adopt the notation that
sampling problem. As explained in more detail in Sec. V C, assigns the indices i, j , l to state, action and all units,
classical algorithms that solve this problem should gen- respectively, and the indices kη to hidden units of the ηth
erate samples from a distribution that is close to a target hidden layer. Note that general Boltzmann machines with
Gibbs distribution of a classical or quantum Hamiltonian, unrestricted connections (UBM) can also be mapped to
θ
while, in the quantum case, the analog goal is to prepare Hamiltonians of the form HDBM . One way of doing this is
quantum Gibbs states that encode such distributions. We by simulating the intralayer connections of a UBM using
focus on the complexities of such algorithms to quantify new connections to additional hidden units (that can be
the runtime of our subroutines. placed in the next hidden layer) [105].
010328-15
The DBM free energy however does not have a neural- This free energy again does not take a neural-network
network formulation, and is instead expressed as formulation and has the following expression:
θ θ
θ FQBM (s, a) = Tr[ρs,a HQBM ] + Tr[ρs,a log(ρs,a )], (14)
θ
FDBM (s, a) = −log e−H (s,a,h)

h −H θ −H θ
where ρs,a = s,a e QBM s,a /Tr s,a e QBM is the QBM
=− wi,k1 si hk1 − wj ,k1 aj hk1
Gibbs state after a projective measurement s,a .
i,k1 j ,k1

− wkη ,kη+1 hkη hkη+1 − bk hk 3. Access models
η,kη ,kη+1 k In the case of RBMs and DEMs, we take advantage of
the computability of their free energies to avoid using their
− bi si − bj aj
full Hamiltonian, and consider instead the Hamiltonians
i j
defined by their associated DEBNs.
+ P(h|s, a)log[P(h|s, a)], (12) All the Hamiltonians considered here allow sparse
h access [106], meaning that we can efficiently (i.e., in
polynomial time) construct the two unitary oracles:
where hk is the expected value of the hidden unit k under

the conditional distribution P(h|s, a). ÔH |j |k |z = |j |k z ⊕ Hjk ,
2. Semitransverse quantum Boltzmann machines ÔF |j |l = |j |f (j , l) .

QBMs [13,33] are a generalization of Boltzmann
machines with additional transverse-field terms σ x in their For a d-sparse Hamiltonian H , ÔH accepts a row index
Hamiltonians that allow for eigenvectors that are superpo- j and column index k and returns Hjk in a binary for-
sitions of computational basis states. The resulting nondi- mat, while ÔF accepts a row index j and a number l ∈ [d]
agonal Hamiltonians are of the form: and computes in-place the column index f (j , l) of the lth
nonzero element in row j .

θ
HQBM =− wi,k1 σiz σkz1 − wj ,k1 σjz σkz1 While sparse access is sufficient to run all the quantum
i,k1 j ,k1
subroutines described below, its implementation requires
resources far beyond those NISQ devices can offer. This is
− wkη ,kη+1 σkzη σkzη+1 − bl σlz the case of DEBN (and DEM) Hamiltonians for instance,
η,kη ,kη+1 l where the implementation of ÔH requires the computation
of the entire neural-network function coherently, which
− k σk
x
(13) is out of reach of near-term implementations. DBMs and
k
QBMs however also naturally satisfy access in the form of
linear combinations of (local) unitaries H = Nn=1 αn Un ,
parametrized by a real-valued vector θ = (wi,k1 , wj ,k1 ,
where N scales at most quadratically with the number
wkη ,kη+1 , bl , k )i,j ,kη ,l,k and where we adopt the notation
of units of the model, which eases up significantly their
that assigns the indices k to all hidden units. We choose
implementation.
to apply only the transverse-field terms k σkx on hid-
den units (hence the qualifier semitransverse). This choice
θ C. Quantum algorithms for Gibbs state preparation
allows expression of the Hamiltonian in the form HQBM =

v,hv λv,hv |v, h v v, h v |, where |v = |s, a is a given con- At the crux of training energy-based models is the ability
figuration of the state and action units (i.e., a computa- to sample from Gibbs distributions of the form
tional basis state) but |hv is allowed to be an arbitrary
superposition of computational basis states (see Appendix e−βH (x)
pβ (x) = (15)
B). In turn, this leads to a straightforward derivation Trx [e−βH ]
θ
of v HQBM v = hv λv,hv |v, hv v, hv | for v = s,a =
|s, a s, a| ⊗ Ih a projector on a given configuration v = of a Hamiltonian H specifying our learning model, where
(s, a), which helps evaluate the QBM’s free energy and its we assume {x} to be a subset of the eigenvectors of H ,
gradient via sampling (see Sec. V D). These considerations for simplicity [107]. While the energy H (x) of a single
are similar to the ones of Wiebe et al. [35] (albeit visible configuration can be efficiently computed, the partition
units are the ones allowed to be in superposition in their function Trx [e−βH ] = x e−βH (x) involves the energies of
case), who used QBMs for generative training of quantum combinatorially many configurations, making it hard to
states. approximate in general [78].
010328-16
Classically, this distribution can be computed explicitly Alternatively, for DBMs and QBMs, one can opt for a
by evaluating the energy of all the configurations x we heuristic variational approach, where the cost of evaluat-
want to sample over and normalizing the resulting vector ing the objective function is in
of energies. This is the approach we take in the numer-
ical simulations of Sec. IV. Alternatively, one can use O M θ /ε + 1/(p 2 βε)
min
random walk algorithms such as Gibbs sampling to gener-
ate samples from an approximation of this distribution (for gates and preparations of the parametrized quantum cir-
completeness, we describe these random walk algorithms cuit, where ε is the precision of the estimate and pmin an
in Appendix A). approximation parameter.
In a quantum setting, one is interested in the similar goal
of preparing approximately a state of the form [108] A recurring quantity in the subroutines of Theorem 1 is
a bound on the model Hamiltonian’s norm M θ , that one
needs to estimate to be able to apply them. In order to do
√
ψβ = 1 e−βH (x) |x , (16) so, one has several possibilities:
Trx [e−βH ] x
(a) Use M θ = θ 1 , the l1 norm of all the weights
that parametrize a chosen model, as this quantity
called a purified Gibbs state of our model Hamiltonian. also bounds the Hamiltonian’s maximum eigen-
This is because measuring this state in the basis {x}x value.
effectively returns a sample from the Gibbs distribution, (b) Use the maximum finding algorithm to get an
Eq. (15). Interestingly, the exact and Gibbs sampling approximate value [99,109] of M θ [even √ though

approaches mentioned above both have their quantum ana- this algorithm has a complexity in O 2n for
log, each providing a quadratic speed up in sampling
over the classical algorithms (up to overheads in some n = A or AK].
parameters). However, these quantum algorithms utilize (c) Assume that the approximated merit function is
many gates and qubits and will likely require fault-tolerant close to the true function and use a bound on
implementations to be used. To consider approaches that the latter, e.g., |R|
1−γ
max
in the case of VBMs, where
may be more amenable to NISQ constraints, we also study |Rmax | is the largest obtainable reward and γ the
a variational approach to Gibbs state preparation that trades environment’s discount factor.
off guarantees of speed up and convergence with less
The proof of Theorem 1 is a combination of known
resource-consuming primitives.
results along with an additional lemma derived from
We summarize the complexities of these sampling algo-
previous works. In the following, we summarize these
rithms in the following informal theorem, before providing
results, namely fault-tolerant quantum algorithms based
their short descriptions.
on Hamiltonian simulation and quantum simulated anneal-
ing to prepare purified Gibbs states, followed by varia-
Theorem 1 (Purified Gibbs state preparation (informal)). tional approaches to Gibbs state preparation suitable for
Given sparse access to one of the model Hamiltonians near-term quantum devices.
listed above and a bound M θ on its free energy or
clamped energy (i.e., the energy after tracing out the hid- 1. Gibbs state preparation using simulation
den units or simply fixing the input units, respectively), we The approach of van Apeldoorn et al. [99] consists in
can prepare a purified Gibbs state at inverse temperature using Hamiltonian simulation techniques [110] to imple-
β in either ment an approximation of the operation f (H ) = e−H .
Since this operation is not unitary, one needs to rely on
βM θ
so-called block encodings to apply it, that is, construct
TGibbs = O
δ a unitary Ũ such that (0| ⊗ I )Ũ(|0 ⊗ I ) − e−H ≤ ε.
This construction has a complexity in O (H d) for a
quantum-walk steps, where δ is a lower bound on the spec- d-sparse Hamiltonian. One can then apply √ Ũon the A
tral gaps of the Markov chains associated to this quantum register of the
maximally entangled
state 2−n i |iA |iB
walk; or in and use O 2n /Tr[e−H ] rounds of amplitude amplifi-
√ cation [111] to amplify the relevant (|0 ⊗ I ) subspace.

TGibbs 2n βM θ
=O This results in a good approximation of the desired puri-
fied Gibbs state. To get rid of the 1/Tr[e−H ] term in the
gates and queries to the Hamiltonian, where n is the complexity, the authors additionally apply a linear trans-
number of action (and hidden) units of the model. formation on H , that we defer, along with a detailed
010328-17
restatement of their result, to Appendix C. Note that,

Õ{ log Tr[e−βH ]−1 } ≤ Õ( β M θ ), leading to a total
since this approach is based on Hamiltonian simulation
techniques, it is applicable to both classical and quantum preparation complexity in Õ( δ −1 β M θ ).
(i.e., nondiagonal in the computational basis) Hamiltonians
Quantum Hamiltonians. In the case of a quantum Hamil-
without the need for special considerations.
tonian (e.g., for QBMs), while we can still construct sparse
access to this Hamiltonian, its eigenstates are not compu-
2. Quantum simulated annealing (QSA)
tational basis states in general but instead superpositions
In the case of quantum simulated annealing, classical that are a priori unknown. Since random walk algorithms
and quantum Hamiltonian require a separate treatment, need access to the eigenstates of a problem Hamiltonian
that we describe below. We start by an overview of the in order to navigate its energy landscape (and eventu-
classical-Hamiltonian case and then explain how its meth- ally sample from its associated Gibbs distribution), this
ods can be generalized to the quantum-Hamiltonian case. prevents us from constructing the necessary Gibbs sam-
Classical Hamiltonians. The idea behind the QSA pling Markov chains and their associated quantum-walk
approach is to quantize Gibbs sampling random walks to operators [116]. A way to circumvent this issue was intro-
achieve a quadratic speed up in runtime. Random walk duced by Temme et al. [117], then further generalized to
algorithms have a runtime complexity that depends on the work with quantum-walk operators and QSA by Yung and
spectral gap δ of their associated Markov chains (MCs). Aspuru-Guzik [98]. While these approaches use different
This runtime, also called mixing time of the MC, is in constructions, both rely at their core on Hamiltonian sim-
Õ(δ −1 ) (see Appendix A for more details). An example of ulation and quantum phase estimation to get access to the
such MCs is specified by the Gibbs sampling (or more gen- eigenvalues of a quantum Hamiltonian without knowledge
erally, Metropolis-Hastings) algorithm. Pictorially, these of the structure of its eigenvectors. Because we are inter-
MCs define the transition probabilities of a random walk ested here in quantum-walk approaches, we focus here
over the eigenstates of the Hamiltonian that is governed on the method of Yung and Aspuru-Guzik. This method,
by their respective eigenvalues, such that, after the MC as presented, does not use adaptive annealing schedules
mixing time, eigenstates are sampled according to their and relies on projective measurements instead of quantum
Gibbs distribution. A quantum analog of classical random amplitude amplification to adiabatically transform inter-
walks are Szegedy-type quantum walks [112,113]. They mediary states, analogously to the approach of Somma
generalize a MC P to a unitary W(P) called a Szegedy et al. [118].
For thesereasons, the method has a complexity
√
walk operator: by taking P to be a Gibbs sampling MC
in O β M θ 2 / δ , quartically worse in β and M θ
2
whose stationary distribution is the Gibbs distribution pβ
than with diagonal Hamiltonians. We prove however, in
of Eq. (15), W(P) has the quantum state ψβ of Eq. (16)
the following lemma, that the advances in QSA for clas-
as its eigenvalue-1 eigenvector. Using a combination of
sical Hamiltonians presented above are also applicable to
phase estimation [58] and amplitude amplification [111],
quantum Hamiltonians, leading to the same complexities.
one can transform
an initial state |φ
√into the desired
state ψβ using a total of Õ[(| ψβ φ | δ) −1 ] applica-
tions of W(P) [114]. Direct preparation of ψβ from an Lemma 1. Given sparse access to a quantum Hamiltonian
arbitrary
√ state |φ, e.g., the uniform superposition |u = H and a bound M θ on its free energy or clamped energy,

2−n i |i,
can however be intractable since the over-
we can prepare via adaptive quantum simulated anneal-
lap | uπβ |2 can be as small as 2−n (or even worse for ing a quantum state ε close to the coherent encoding of its
a different |φ) [115]. To get around this issue, one can Gibbs distribution at inverse temperature β
instead use these same tools to prepare
several
inter-
mediary states |u = ψβ0 , ψβ1 , . . . , ψβl that follow 1 √
e−βϕ|H |ϕ |ϕ |ϕ̃ , (17)
an annealing schedule on the inverse temperature β of Tr[e−βH ] ϕ
the Gibbs distribution. Intuitively, if the overlap between
these intermediate states is made large by construction,
where |ϕ are eigenstates of H and |ϕ̃ their complex
then the whole transformation should be efficient. For-
conjugates; using
malizing this idea, Wocjan et al. have shown [114] that
given a schedule β0 = 0, β1 , . . . , βl = β and Gibbs sam-
pling Markov chains Pβ0 , Pβ1 , . . . , Pβl with stationary βM θ
O
distributions pβ0, pβ1 , . . . , pβl that fulfill the slow-varying δ
condition | ψβi ψβi+1 |2 ≥ c, preparing an ε approxima-
√
tion of ψβ has complexity O[(l/c) δ −1 log2 (l/ε)]. Using quantum-walk steps, where δ is a lower bound on the spec-
an adaptive annealing schedule, Harrow and Wei [97] tral gaps of the Markov chains associated to this quantum
were able to provide a construction with length l = walk.
010328-18
Proof. The proof of this lemma goes in two steps, each with respect to the Hamiltonian H , one can use this free
extending the construction of Yung and Aspuru-Guzik [98] energy as an objective function to be minimized by |ψθ .
to the algorithmic improvements of Wocjan et al. [114] and The existing methods in variational Gibbs state preparation
Harrow and Wei [97] respectively: [100–102] vary in three aspects.
Step 1. We first note that the generalized Szegedy walk
operators defined by Yung and Aspuru-Guzik for quan- (a) The choice of Ansatz: Chowdhury et al. [100] use
tum Hamiltonians satisfy the relevant properties that are an Ansatz derived from Trotterized adiabatic state
required for the QSA approach of Wocjan et al. (Theorem preparation, while Wu and Hsieh [101] opt for an
2 in Ref. [114]) that is based on amplitude amplifica- Ansatz motivated by the quantum approximate opti-
tion instead of projective measurements. These properties mization algorithm (QAOA), and Wang et al. [102]
are namely having a purified Gibbs state of the form choose a so-called hardware-efficient Ansatz.
√ (17) as eigenvalue-1 eigenstate and a phase gap ≥
Eq. (b) The subroutines to evaluate the Ansatz’ free energy
2 δ. Note that, as shown by Yung and Aspuru-Guzik, the [Eq. (18)]: the objective function is composed of
eigenbasis of the quantum Hamiltonian is not needed to the average energy of ρ with respect to H and
construct the generalized walk operators and input states the von Neumann entropy of ρ. The latter is
of these QSA algorithms. Additionally, by working in the more term challenging to evaluate. While Wu
the eigenbasis of the quantum Hamiltonian instead of the and Hsieh consider a decomposition into several
computational basis, the error analysis in Lemma 3 and Renyi entropies, measured using swap tests [119];
Corollary 3 of Ref. [114] to derive the length and overlaps Wang et al. and Chowdhury et al. use a trunca-
of the annealing sequence is straightforwardly application of its Taylor series, where the terms are com-
ble to quantum Hamiltonians, leading to a preparation puted via (high-order) swap tests and amplitude
√
complexity in O βM θ / δ . estimation [58], respectively.
(c) The optimization method: here one can opt for a
Step 2. By tracing the construction steps of Harrow and
gradient-descent approach, where several evalua-
Wei (Theorem 4 of Ref. [97]), we can see that the proof
tions of the objective function and finite-difference
of existence of an annealing (or cooling) schedule, which
formulas can be used to estimate its gradient; or else
ensures the slow-varying condition (i.e., large overlaps
use gradient-free methods such as Nelder-Mead or
between intermediary states) in their adiabatic preparation
Powell’s method [120].
is independent of the Hamiltonian eigenbasis. Hence the
proof can be directly expressed in the eigenbasis of quan- The approaches for free-energy evaluation are especially
tum Hamiltonians without any other change. Moreover, useful for us since they can directly be used for the merit-
the tools they use for the adaptive construction of these function evaluation subroutine of Theorem 2. We choose
schedules, namely nondestructive amplitude estimation for to use the approach of Chowdhury et al., that we restate in
binary search of the next inverse temperature β (Theorems Appendix C for completeness. Note however that, depend-
9 and 10 of Ref. [97]), are also applicable to general- ing on the RL application, a different approach might be
ized Szegedy walk operators. Hence, by applying the same more suitable, as different approximation methods corre-
methods,
we obtain
an overall preparation complexity in spond to different effective models. Further investigations
O βM θ /δ . should reveal which methods are more interesting in a RL
context.
3. Variational Gibbs state preparation

Recent advances in the field of quantum compu- D. Quantum algorithms for merit function and
tation have motivated the development of algorithmic gradient evaluation
approaches that are compatible with near- and midterm As mentioned above, the neural-network formulation
devices. One particularly promising approach for the of the RBM and DEM free energies (as DEBNs) allows
preparation of Gibbs states is based on a variational for efficient exact computation of the modeled merit func-
optimization method. The general idea is to train a tions and their gradients. As for DBMs and QBMs, it is
parametrized Ansatz |ψθ = U(θ ) |0 such that it approx- easy to see from the formulation of their free energies
imates the state of Eq. (16), or equivalently, that its den- [Eqs. (12) and (14)] that these should be estimated via
sity matrix ρ = |ψθ ψθ | approximates the Gibbs state sampling from their Gibbs distributions. More specifically,
e−βH /Tr[e−βH ]. Since this target Gibbs state is the unique these are the Gibbs distributions of the clamped Hamiltoni-
state that minimizes the free energy ans, where state and action units have been fixed to a given
configuration.
1 Regarding the gradients of these free energies how-
F = Tr[ρH ] + Tr[ρlog(ρ)] (18)
β ever, the argument becomes slightly more intricate in
010328-19
the case of QBMs. As derived in Appendix C, deriva- this state-action pair can be estimated to precision ε in
tives of the QBM free energy take the form ∂θ F(s, a) =
−Tr[s,a ∂θ e−HQBM ]/Tr[s,a e−HQBM ]. Due to the nondiago-
K + KL2 (L − 1) 1
nal terms in HQBM , we have that HQBM and ∂θ HQBM do not O TGibbs · min , 2 ,
ε ε
commute, making it impossible to use the identity ∂θ e−H =
−∂θ He−H that would apply for instance to DBM Hamilto-
nians. Nonetheless, due to our restriction to semitransverse where TGibbs is the complexity of preparing the Gibbs state
QBMs and by the use of Duhamel’s formula [13], we of the clamped Hamiltonian, K is the number of hidden
can show that the gradient of the QBM free energy takes units in the model and KL is the number of hidden units in
the form of expectation values Tr[(∂θ HQBM )ρs,a ] where each of the L layers of the model.
∂θ HQBM are simply (products of) Pauli σ z and σ x operators
appearing in HQBM .
Hence, merit-function and gradient evaluation boil Proof. We count the number
of components
of the gradi-
down to Gibbs state preparation again, but with fixed state ent to of the order O K + KL (L − 1) . Using one of the
2
and action units s, a. We formalize the subroutines used for subroutines in Theorem 1, we can prepare a purified Gibbs
these evaluations in the following theorems. state of the clamped model Hamiltonian at inverse temper-
ature β = 1 in time TGibbs . From there, we can take two
approaches.
Theorem 2 ((DBM and QBM merit-function evaluation)).
θ
Let HDBM−QBM be a Hamiltonian describing a deep Boltz- (a) Measuring the hidden units of the Gibbs state in
mann machine as specified by Eq. (11) or a semitrans- the computational basis gives us a vector h sam-
verse quantum Boltzmann machine as specified by Eq. pled from the distribution P(h|s, a). For the QBM,
(13) and (s, a) a given assignment of the state and action we also need to estimate (in a separate round of
θ
units. The approximate free energy FDBM−QBM (s, a) = measurements) the expected values of Pauli σkx for
θ
−M (s, a) associated to this state-action pair can be all hidden units k, which can be done by applying
estimated to precision ε in a Hadamard gate on each hidden unit before mea-
suring them in the computational basis. By repeat-
TGibbs · θ1 1 ing these measurements O{1/ε2 log[K + KL2 (L −
O + 2 ,
ε pmin ε 1)]/δ} times and averaging the resulting vector sam-
ples, we can estimate each component of the gra-
where TGibbs is the complexity of preparing dient to precision ε with probability 1 − δ using
the Gibbs Chebyshev’s inequality and the powering lemma.
state
of the clamped
Hamiltonian, θ 1 = i,k |wi,k |si +
|w |a + |w | (+ | |) and pmin is the (b) We can use amplitude estimation to compute the
j ,k j ,k j k,k k,k k k
cutoff used in the estimation of the entropy term of the free expected value of each hidden unit, Hadamard-
energy. transformed hidden unit and pair of hidden units.
But every single term has to be estimated individu-
ally, which amounts to a total number of calls to the
Proof. Using one of the subroutines in Theorem 1, we can Gibbs state preparation unitary in O{[K + KL2 (L −
prepare a purified Gibbs state of the clamped model Hamil- 1)]/ε log δ −1 } to get all estimates to precision ε
tonian at inverse temperature β = 1 in time TGibbs . We then with probability 1 − δ. Note that this approach only
use the subroutines of Theorems 6 and 7 (Appendix C) to applies to Gibbs state preparation subroutines that
evaluate the average energy and von Neumann entropy of create a coherent encoding of the Gibbs sate [see
(θ1 /ε) and O
this state using O 1/(p 2 ε) calls to the Eq. (16) and associated footnote] since we need a
min
Gibbs state preparation unitary, respectively. Note that the coherent preparation of our quantum state to apply
success probability of these subroutines can be amplified to amplitude estimation on it [122]. It is hence com-
1 − δ using log δ −1 repetitions and taking the median of patible with the quantum simulated annealing and
the outcomes, according to the powering lemma [121]. variational approaches but not the Hamiltonian sim-
ulation method.
Theorem 3 (Gradient evaluation of a DBM and QBM
θ
merit function). Let HDBM−QBM be a Hamiltonian describ- Note that the clamped model Hamiltonians now act on a
ing a deep Boltzmann machine as specified by Eq. subspace of size 2K for a model with K hidden units. This,
(11) or a semitransverse quantum Boltzmann machine as well as a bound H θ on the model’s clamped energy
as specified by Eq. (13) and (s, a) a given assignment that is potentially smaller (since action units are fixed),
of the state and action units. The gradient of the free leads to different complexities TGibbs in Theorem 1 and
θ
energy ∇θ FDBM−QBM (s, a) = −∇θ M θ (s, a) associated to Table I. Moreover, while the preparation of the clamped
010328-20
Gibbs state ρs,a is straightforward for the Hamiltonian sim- Of course, any such implementation will still suffer from
ulation and quantum simulated annealing approaches, we noise and decoherence effects, but these can be (partially)
point out that one needs to train again a new Ansatz |ψθ mitigated through error-mitigation techniques [125,126].
in the variational approach. This is because, in general, one Errors that remain will still affect the computation, but as
has no way of efficiently clamping state and action units in has been argued previously, in ML contexts, this may be
the Ansatz trained on the full Hamiltonian (postselection less detrimental [2,127]. Performance of such near-term
on measurements or amplitude amplification are possi- implementations remains an open area of research.
ble but resource consuming). We consider this to be an The presented framework for quantum energy-based
interesting question that could be explored in future work. RL does not limit itself to the methods and models
presented here. Following current advances in quantum
VI. CONCLUSION AND OUTLOOK
machine learning, we expect genuinely quantum func-
Recent years have seen a massive surge of interest tion approximators such as parametrized quantum circuits
in leveraging results from quantum-information theory to [14,128–130], quantum generative models [41], or quan-
enhance machine-learning methods and models. Nonethe- tum kernels [131–133] to present interesting features (e.g.,
less, deep reinforcement learning, as one of the emerging the ability to represent quantum correlations) that could
fields in machine learning, remains largely unaffected by provide advantages in complex RL environments. More-
this trend. This is in part due to an effort made by the over, our study of sampling algorithms has been consid-
machine-learning community to avoid bottlenecks result- ering the restrictive setting of isolated (or context-free)
ing from application-specific algorithms in favor of general sampling distributions. The RL setting however offers an
applicability. online structure that could be exploited by the sampling
In this paper, we consider deep energy-based models algorithms. For instance, inspired by existing quantum
specifically designed to cope with large state and action algorithms that speed up sampling from sequences of
spaces in complex RL environments. We demonstrate that slowly evolving Markov chains [115], one could adapt
these models can indeed outperform standard RL archi- the Markov chain-based approach of quantum simulated
tectures such as deep Q networks due to their capacity to annealing to exploit the property that the agent’s policy
represent complex correlations between state-action pairs. is slowly evolving as well during training. Additionally,
The enhanced representational power of deep energy- one could adapt other heuristic methods that acceler-
based models comes at the cost of sampling from an asso- ate approximate sampling such as multicanonical Markov
ciated probability distribution. Even approximate sampling chain Monte Carlo [134] to a reinforcement learning sce-
is a hard problem and therefore, gives rise to a computa- nario, as to encourage exploration and learning multimodal
tional bottleneck. In this paper, we address this bottleneck policies [135]. In sum, our framework opens up a new
by leveraging results from quantum approximate sampling avenue for connecting a large class of quantum models and
algorithms in order to enable various models for deep methods to be applied in the context of quantum RL.
energy-based RL.
Our main result is a hybrid deep energy-based RL ACKNOWLEDGMENTS
algorithm that enables quadratic quantum speed ups with
full-scale quantum computers and is applicable to vari- S.J., H.P.N., L.M.T., and H.J.B. acknowledge sup-
ous deep energy-based models. In the interest of deploying port from the Austrian Science Fund (FWF) through
near-term quantum devices, we further prove that the the projects DK-ALM:W1259-N27 and SFB BeyondC
same algorithm can be combined with deep Boltzmann F7102. S.J. also acknowledges the Austrian Academy of
machines and quantum Boltzmann machines to provide Sciences as a recipient of the DOC Fellowship. H.J.B.
heuristic quantum speed ups with near-term quantum com- is also supported by the Ministerium für Wissenschaft,
puters. Forschung, und Kunst Baden Württemberg (AZ:33-7533.-
In our work, we consider both methods, which require 30-10/41/1). This work is in part supported by the Dutch
full fault-tolerant devices (that are, according to recent Research Council (NWO/OCW), as part of the Quantum
investigations, still out of reach [123,124]), as well as vari- Software Consortium program (Project No. 024.003.037).
ational methods, which may be more compatible with near- The computational results presented here are achieved in
term quantum computers (see Table I). When considering part using the LEO HPC infrastructure of the University of
near-term restrictions, we must account for size limits, Innsbruck.
gate noise, decoherence, and architecture constraints. With
respect to size, as we detail in Appendix D, for a deep or APPENDIX A: RANDOM AND QUANTUM WALKS
quantum Boltzmann machine with n units, one would only
require 2n + 3 qubits to run our variational methods. This 1. Mixing-time random walks
means that interesting and useful algorithms could be run A discrete-time random walk on a graph is described
with the order of 100 qubits, so well in the NISQ-era range. by a Markov chain (MC) that is specified by a transition
010328-21
matrix P gathering the transition probabilities [P]i,j = Pij pair (x, x ) ∈ X , which is used to define P as follows:
from node i to node j of the graph. When the MC is ergodic
(i.e., irreducible and aperiodic), it has a unique stationary Rxx K
xx if x =
x
distribution π = Pπ over the nodes of the graph. π can be Pxx = . (A3)
1 − x =x Rxx Kxx if x = x
approximated by repeated applications P t π 0 of P to any
initial distribution π 0 , or equivalently, by applying t steps In practice, K is chosen such that the sets N (x) = {x :
of the random walk P to any initial distribution π 0 over the Kxx > 0} define a neighborhood structure on X and
nodes of the graph (see Fig. 11). The minimal number of maxx∈X N (x) |X |. A possible choice for N (x) and
applications of P needed to obtain an -close approxima- K(x, ·) is, e.g., a Hamming ball around x and the uniform
tion of π (in lp norm) starting from any π 0 is called the distribution over it, respectively.
mixing time of P: If Kxx > 0 ∀x ∈ X and the graph defined by N is con-
nected, then MH [137] ensures that P is reversible and has

= min{t : ∀π 0 , P π 0 − π p ≤ }.
tmix t
(A1) as stationary distribution:
When the MC is moreover reversible, its mixing time can eβf (x)
be related to its spectral properties. More especially, we πβ (x) = βf (x ) . (A4)
a e
have [103]
! " # $ 3. Gibbs sampling (GS)
1−δ 1 1 1
≤ tmix
≤ log ∗ , (A2) Gibbs sampling [138] is a particular instance of the
2 log(2 ) δ δ π
Metropolis-Hastings algorithm where the neighborhood
where δ = 1 − maxi>0 |λi | ∈ (0, 1] (i.e., the difference N (x) is a Hamming ball of radius 1 around x (including
between the largest and second largest eigenvalues in abso- x itself) and the candidate transition matrix K is given by
lute value of the matrix P) is the spectral gap of P and '
π ∗ = mini {πi }. We say that the mixing time of a reversible eβf (x )
1
log |X | eβf (x) +eβf (x )
if x ∈ N (x)
Kxx = . (A5)
MC P is of the order Õ(δ −1 ) [136]. 0 if x ∈
/ N (x)
2. Metropolis-hastings (MH) This K has the special property that its corresponding
The MH algorithm is a general Markov chain Monte acceptance ratio Rxx = 1 for all x ∈ N (x). In particular,
Carlo method that allows one to sample from a prob- the conditions for reversibility of P are always met.
ability distribution over a configuration space X given
only access to its corresponding density function f (x) 4. Szegedy walks
without having to evaluate its value for all x. When Szegedy-type quantum walks [112,113] are a quantum
dealing with the energy-based models listed in Sec. V, analog of classical random walks. They generalize an MC
X can either be the space A of all actions and f (x) = P to a unitary W(P) called the Szegedy walk operator.
−F θ (s, a) for a fixed state s, or X is the space A × Here we describe how this unitary can be constructed and
H of all configurations of action and hidden units and explain its interesting properties.
f (x) = −E θ (s, a, h). MH constructs a reversible MC P Let H be the Hilbert space that shares the same basis
having the desired distribution as stationary distribution vectors as the MC P = [Pij ]. So-called quantum diffusion
by transforming a candidate transition matrix K into P. operators UP and VP act on H ⊗ H as
This construction
% starts
by introducing an acceptance
& ratio
Rxx = min 1, exp βf (x ) Kx x /exp[βf (x)]Kxx for each '
UP |i |0 = |i j Pij |j
,
VP |0 |i = j Pij∗ |j |i
where P ∗ is the time-reversed MC [139] associated to P.

Its elements are given by Pij∗ = Pji πi /πj , where π = (πi )
is the stationary distribution of P. When P ∗ = P, we say
that the MC is reversible. Although Szegedy walks can
be defined if this property is not fulfilled, one would
require additional access to P ∗ to be able to implement VP .
FIG. 11. Mixing-time random walks. Note that the graphs are Because this is not usually the case (the expression of P ∗
generally nonsymmetric (that is, the connections are directed), involves the stationary distribution of P), the reversibility
but are represented symmetric for simplicity. of P is then crucial for the implementation of VP .
010328-22
With the quantum diffusion operators UP and VP at hand, distribution:

we can construct the Szegedy walk operator W(P), which
θ
is the reflection over the spaces A and B:
θ e−E (x)
P (x) = . (B1)
−E θ (x )
' x e
A = span{UP |i |0 : i = 1, . . . , N }
B = span{VP |0 |i : i = 1, . . . , N } This energy function takes the form of an Ising model spin
Hamiltonian, which accounts for pair interactions wik and
W(P) = ref(B) ref(A), individual external magnetic fields bi {[(wik )i,k , (bi )i ] =
θ ∈ Rd } acting between and over spins xi and xk . Different
where configurations of these spins correspond to different binary
vectors x ∈ X . Commonly used types of BMs are the
' so-called restricted Boltzmann machines [79,141], char-
†
ref(A) = UP [1N ⊗ ref(0)] UP
† (A6) acterized by their bipartite structure (see Fig. 3), which
ref(B) = VP [ref(0) ⊗ 1N ] VP restricts the interactions wik in the Ising model Hamiltonian
and energy function. RBMs divide the spins {xi }i between
and two subsets v = {vi }i and h = {hk }k , respectively, called
visible and hidden units. The visible units represent the
ref(0) = 2|00| − 1N . data that we want to encode in the model, while the hidden
(or latent) units allow more degrees of freedom to be had
We refer to A + B as the busy subspace and to its orthog-
in the specification of the energy function and allow more
onal complement A⊥ ∩ B⊥ as the idle subspace. W(P) acts
complex distributions to be represented over the visible
as the identity on the idle subspace, but, on the busy
units. The RBM model effectively encodes the probability
subspace, its unique eigenvector with eigenvalue 1 (or,
distribution:
equivalently, eigenphase 0) is
θ (v,h) θ
√ θ e−E e−F (v)
|π |0 = πi |i |0 . P (v) = h
= , (B2)
v ,h e−E θ (v ,h ) v e
−F θ (v )
i
where
This follows from Refs. [112,113].
As we are interested in the spectral gap δ of P classically, −E θ (v, h) = wik vi hk + bi vi + bk hk (B3)
we are interested in the phase gap = 2 mini>0 |θi |, i.e., i,k i k
the smallest nonzero eigenphase of W(P), in the quantum
case. Given that the eigenvalues of P and the eigenphases and
of W(P) are related by |λi | = cos(θi ) ∀i, it can be shown
[112,113] that
θ −E θ (v,h )
√ F (v) = −log e
≥ 2δ, h
θ

i.e., the phase gap of W(P) is quadratically larger than the e−E (v,h)
spectral gap of P. This is a crucial property for Szegedy = −log ∀h
P(h|v)
quantum walks as a “quantum search” [58,111] for |π |0
on the busy subspace of W(P) is governed by its phase gap = E θ (v, h) + log [P(h|v)] ∀h
as Õ(−1 ). This is quadratically faster than sampling from
π classically. = P(h|v){E θ (v, h) + log [P(h|v)]}
h
APPENDIX B: BOLTZMANN MACHINES AND = E θ [v, hP(h|v) ] − H (h|v)

ENERGY-BASED MODELS
1. Restricted Boltzmann machines is the so-called free energy of the RBM (the average
energy after tracing out the hidden units).
Boltzmann machines [140] are energy-based genera-
tive models, meaning that they can learn probability dis-
tributions p̂(x) over a set X ⊂ {0, 1}n , by encoding a 2. Derivation of the free-energy neural network
parametrized distribution p θ (x). This probability distribu- We derive below the expression of a RBM free energy as
tion is specified by an energy function E θ (x) defined over a feedforward neural network. This derivation is inspired
the entire set X and normalized by a Boltzmann-Gibbs by Ref. [75].
010328-23
Starting from the expression of a RBM probability distribution:

⎛ ⎞
1
P θ (v) = θ exp ⎝ wik vi hk + bi vi + bk hk ⎠
Z h i,k i k

1 ,
= θ exp bi vi exp wik vi hk + bk hk # eax1 +bx2 = (1 + ea )(1 + eb )
Z i k hk ∈{0,1} i x∈{0,1}2

1 ,
= θ exp bi vi 1 + exp wik vi + bk # eax1 +bx2 = (1 + ea )(1 + eb )
Z i k i x∈{0,1}2
' -
1 ,
= θ exp bi vi exp log 1 + exp wik vi + bk # ak = exp log(ak )
Z i k i k k

1
= θ exp bi vi + softplus wik vi + bk # softplus(x) = log[1 + exp(x)]
Z i k i
1
= θ
exp −F θ (v)
Z
we end up with the σ z and σ z σ z operators from the σ x operators in the

Hamiltonian of Eq. (13) allows the latter to be expressed as
θ
−F (v) = bi vi + softplus wik vi + bk , ⎛ z,zz ⎞
i k i H1 0 ··· 0
(B4) ⎜ 0 H2z,zz · · · 0 ⎟
⎜ ⎟
HQBM = ⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
which takes the form of the feedforward neural network
depicted in Fig. 4. See Fig. 12 for a comparison between 0 0 · · · H|z,zz
V |×|H|
⎛ x ⎞
the softplus function and the commonly used Rectified H1,1 · · · H1,| x
H|
Linear Unit (ReLU) activation function for NNs. ⎜ .. ⎟
+ Iv ⊗ ⎝ ... ..
. . ⎠
3. Eigendecomposition of HQBM H|xH|,1 ··· H|xH|,|H|
We write the QBM Hamiltonian in the computational ⎛ ⎞
basis (v, h), where we divide the Hilbert space on which M1 0 ··· 0
it acts into visible and hidden units V ⊗ H. Separating ⎜0 M2 ··· 0 ⎟
=⎜
⎝ ... .. .. ⎟ = |v v| ⊗ Mv ,
. ⎠
..
. . v
0 0 ··· M| V |
% &
where M1 , . . . , M|V%| are nondiagonal matrices acting&
on the subspaces span{v1 } ⊗ H, . . . , span{v|V | } ⊗ H ,
respectively, for vi computational basis states of visible
units. This block diagonal form implies that the eigen-
vectors of HQBM are block vectors of the form |v, hv ,
where |v is a computational basis state and |hv can be
a superposition of hidden unit configurations.
4. Derivation of the PS error function

FIG. 12. Comparison between the commonly used ReLU We show in Sec. III A how to derive an error function
activation function max(0, x) and the softplus function x + from the Q-learning update rule. Due to its shared origin,
log 1 + e−x = log(1 + ex ) appearing in the expression of the deriving an error function from the SARSA update rule can
free energy of a RBM. be done similarly. In contrast, h values are not linked to the
010328-24
Bellmann equation and hence do not represent estimates of inverse temperature β ∈ [0, 1]. Assume that we can gener-
the (bounded) return. Instead, they are allowed to accumu- ate the state |0 corresponding to the coherent encoding
late rewards indefinitely (and hence diverge). Diverging h of the prior, and assume that for every inverse temperature
values can still lead to converging policies after normal- β we have a Markov chain Pβ with stationary distribu-
ization but they constitute a problem for the derivation of
tion β and spectral gap lower bounded by δ. Then,

an error function, as this error can never be minimized for any ε > 0, η > 0, there is a quantum algorithm that,

to 0, hence making neural networks harder to train. One with probability at least 1 − η, produces a state
01 so

mechanism in PS that allows us to avoid this divergence is that 01 − |1 ≤ ε for |1 the coherent encoding
damping, and we show here how it can be translated into a of the posterior distribution 1 (x) = 0 (x)[e−L(x) /Z(1)].
regularization term in the PS error function. The algorithm uses
Consider the original update rule of PS [45]:
1 2
hij ← hij − γPS (hij − 1) + gij λ, (B5)
O E0 [L(x)]/δ
where γPS ∈ [0, 1] is an internal damping (or forgetfulness) total steps of the quantum-walk operators corresponding
parameter allowing the agent to adapt to changing envi- to the Markov chains Mβ .
ronments, λ is the last positive reward experienced by the
agent, and the edge glow gij is a varying rescaling factor
accounting for delayed rewards. In the case of a DEBN, we use Theorem 4 with
Let us take this update rule with damping γPS = 0 and x = a, L(x) = −βM−θ (s, a) and 0 (x) = |A1 | . M−θ (s, a) is
edge glow gij = 1: obtained from M%θ (s, a) by the transformation
& M−θ (s, a) =
M θ (s, a) − max 0, maxa M θ (s, a) , such that −2 maxa
hij ← hij + λ. (B6) |M θ (s, a)| ≤ M−θ (s, a) ≤ 0, i.e., 0 ≤ L(x) ≤ 2β maxa
M θ . Hence, sampling from the distribution πβ
This setup cancels the damping mechanism and supposes θ βM θ (s,a)
that edge-glow values gij = 1 are absorbed in the reward (a|s) = eβM − (s,a)
% /Z−,β (s) &= e /Zβ (s) has complex-

ityin O E0 [L(x)]/δ = O (1/|A|) a L(a)/δ ≤
λ. This update rule gives the following error function:
O βM θ /δ .
EPS [s(t) , a(t) ] = r(t+1) , (B7)
Note: while the value M θ = maxa |M θ (s, a)| is not
needed to implement the quantum-walk operator of Pβ
which is never 0 as long as the given transition is rewarded associated to M− (s, a) (see Appendix E), it is actually
(which is the case in a static environment) and leads to needed in a subroutine behind the algorithm of Theorem
diverging h values. 4 (see Algorithm 1 in Ref. [97]).
To avoid this divergence, we take into account damping A similar approach can be taken for DBMs, where now
by effectively introducing a regularization term: x = (a, h), L(x) = βE− (s, a, h) and 0 (x) = 1/|A|2K .

O
EPS [s(t) , a(t) ] = r(t+1) − γPS hij (B8) The resulting

complexity
is now in
(1/|A|2K ) a,h L(a, h)/δ ≤ O βM θ /δ (for a
such that the error is now 0 when hij = r(t+1) /γPS , hence different spectral gap δ).
effectively bounding the h values. Note that EPS [s(t) , a(t) ] =
h(t+1)
ij − h(t)
ij , so this new error corresponds to the update Theorem 5 (Gibbs state preparation with sparse access
rule: [99, Lemma 44]). Let H ∈ CN ×N be a d-sparse Hamilto-
nian. Suppose that I H , we are given C ∈ R+ such that
hij ← (1 − γPS )hij + λ, (B9) H ≤ 2C, and we know a lower bound z ≤ Z = tr[e−H ].
If ∈ (0, 1/3), then we can prepare a purified Gibbs state
which is the original PS update rule [Eq. (B5), with the |γ AB such that
edge glow fixed to 1].
e−H
APPENDIX C: DEFERRED THEOREMS AND TrB |γ γ |AB − ≤
Z
DERIVATIONS FOR QUANTUM SUBROUTINES
Theorem 4 (Harrow and Wei [97, Theorem 10]). Assume using
that we are given a prior distribution 0 (x) and a like-
lihood function L(x) ≥ 0, so that N C 1
we can parametrize O Cd log log
the partition function Z(β ) = x 0 (x)e−β L(x) at each z
010328-25
queries, and ∂H (s, a, h)

= P(h|s, a)
' - ∂θ
N C 1 C h
O Cd log log log(N ) + log5/2 = ∂θ H P(h|s,a) ,
z
gates. where ∂θ H is the σlz or σlz σlz operator associated to the

parameter θ (or equivalently the random variables associ-
For the DEBN, we have N = 2A , while N = 2AK for ated to units l and l of the DBM), depending if the latter
the DBM and QBM, where K is the number of hidden is a bias or weight respectively. P(h|s, a) is the Gibbs dis-
units of these models. We enforce the condition I H tribution obtained by fixing the operators σ z associated to
by choosing H = βHmodel − (β λ̃min − 3/2)I , where λ̃min state and action units to the configuration (s, a).
is either bounded by the weights of the model or estimated
via a minimum finding subroutine. This choice also gives 2. Derivation of the gradient expression for QBMs
us natural bounds z = e−2 and C = βHmodel . Since the
DEBN and DBM Hamiltonians are diagonal, their spar- Since the QBM Hamiltonian involves both σ x and σ z
sity is d = 1. As for the QBM Hamiltonian, the σ x terms terms, we have that H and ∂θ H do not commute, and
each add at most one nonzero entry to each row and hence ∂θ e−H = −∂θ He−H . This makes the derivation of
column, which leads to a sparsity d = 1 + K. Hence the the gradient a bit more
involved.

complexity of preparing the purified Define Zs,a = Tr s,a e−H , then
√Gibbs states of
the
DEBN and DBM and QBM is in O 2 βHmodel and
A
√ ∂F ∂
= − log Zs,a
O 2AK βHmodel , respectively. ∂θ ∂θ
1 ∂Zs,a
=−
Theorem 6 (Average energy estimation [100, Theorem Zs,a ∂θ
2]). Let H be a Hamiltonian that can berepresented as
a linearcombination of unitaries H = Kk=1 αk Uk with Tr s,a ∂θ e−H
=− .
|α|1 = k |αk | for K ∈ O(log N ), and let Uψ be a unitary Tr s,a e−H
that prepares a state |ψ. The number of applications of
the unitaries UP , US and the unitary Uψ to estimate the Since ∂θ e−H = −∂θ He−H , we now rely on Duhamel’s
average energy ψ| H |ψ within error with probability formula to express
at least 2/3 is in O (α1 /).
3 1
−H
Theorem 7 (von Neumann entropy estimation [100, ∂θ e =− e−τ H ∂θ He(τ −1)H dτ ,
0
Theorem 1]). Given
√access to an oracle Uρ that prepares 3 1
a purification pj |j ψj of a density matrix ρ = ⇒ s,a ∂θ e −H
=− s,a e−τ H ∂θ He(τ −1)H dτ .
j

j pj ψj ψj where pj ≥ pmin , the von Neumann entropy 0
of the state ρ can be estimated within error with proba-
bility greater than 2/3 with O [1/(p 2 )] preparations of We then use the circular permutation property of the trace
min
the purified density operator. to get
3
1
1. Derivation of the gradient expression for DBMs Tr s,a ∂θ e−H = − Tr ∂θ He(τ −1)H s,a e−τ H dτ .
This derivation is relatively easy because we can use the 0
property ∂θ e−H = −∂θ He−H with a diagonal Hamiltonian

Using the eigendecompostion of HQBM = λv,hv
H . Define Zs,a = h e−H (s,a,h) , then v,hv
|v, hv v, hv | (Appendix B), we find
∂F ∂
(s, a) = − log Zs,a
∂θ ∂θ e(τ −1)H v e−τ H = e(τ −1)λv ,hv v , hv v , hv
1 ∂Zs,a v ,hv
=−
Zs,a ∂θ v e−τ λv ,hv v , hv v , hv
1 ∂e−H (s,a,h) v ,hv
=−
Zs,a h ∂θ = e−λv,hv |v, hv v, hv |
1 ∂H (s, a, h) −H (s,a,h) hv
= e
Zs,a h ∂θ = v e−H v
010328-26
and hence an alternative approach that only requires measuring the

3 state |ψθ in the Z ⊗n basis (and X ⊗n basis for a QBM)
1
Tr s,a ∂θ e−H = − Tr ∂θ H s,a e−H s,a dτ without any ancillas but Õ (H /ε)2 iterations. As for

0
the entropy estimation subroutine, we again take the phase
= −Tr ∂θ H s,a e−H s,a estimation approach of Ref. [100] (Fig. 1) using 1 con-
trolling ancilla qubit and n + 2 ancillas for Hamiltonian
and simulation of the density operator ρ = |ψθ ψθ | [using
the state-of-the-art techniques of Ref. [110] (last row of
∂F Tr ∂θ H s,a e−H s,a Table I)]. At most, these subroutines require 2n + 3 logical
=
∂θ Tr s,a e−H qubits to be executed. A study of the number of quan-
tum operations needed to implement these subroutines on
= Tr[(∂θ H )ρs,a ] quantum hardware and the noise resilience of these meth-
= ∂θ H ρs,a , ods would be an interesting topic of further investigation,
possibly inspired by the methods of Ref. [123,124].
where now ∂θ H is the σkx , σlz or σlz σlz operator associated
to the parameter θ.
APPENDIX E: COHERENT ACCESS TO THE
APPENDIX D: EVALUATION OF THE VARIATION TRANSITION MATRIX
APPROACH REQUIREMENTS In this appendix, we detail the construction of the dif-
The variational approaches for Gibbs state prepara- fusion operator UP required to implement quantum walks
tion presented in Sec. V 3 involve three main compo- (see Appendix A). We want to implement the oper-
nents: the preparation of the parametrized Ansatz state ation |i1 |02 |0aux → |i1 j ∈N (i) Pij |j 2 |0aux coher-
|ψθ , the evaluation of its average energy Tr[|ψθ ψθ | H ], ently for all actions i and for the transition matrix P and
and the evaluation of its von Neumann entropy neighborhood structure N given by Metropolis-Hastings
Tr[|ψθ ψθ | log(|ψθ ψθ |)]. We consider here the sim- or Gibbs sampling (Appendix A). In order to achieve
plest scenario imaginable in order to find the minimal num- this, we assume access to the description of the clas-
ber of qubits required to run these methods, using current sical circuits
that compute the4 operations |i1 |j 2 |0 →
state-of-the-art techniques. For that, we consider our model |i1 |j 2 Pij and |i |0 → |i j ∈N (i) |j , having com-
Hamiltonian to be that of a DBM and QBM whose con- plexity C(P) and O[poly(M )], respectively. Using either
nectivity follows the natural connectivity of the quantum Fredkin or Toffoli gates, it is possible to design quan-
hardware at hand. This Hamiltonian acts on n qubits (the tum circuits that implement coherently the operation
number of the units in the DBM and QBM) and has at most |i |0⊗M |0aux → |i |Pi1 · · · |PiM |0aux with compara-
d = n2 + 2n terms. We can take our Ansatz |ψθ to be a ble complexities by reversibilization (Sec. 3.2.5. of Ref.
quantum state that belongs to the hardware-efficient family [142]), where M = maxa∈A N (a) and the j ∈ N (i) have
of n qubits, as in Ref. [102], or a QAOA-inspired Ansatz, been renamed 1, . . . , M for each i for simplicity. Using
as in Ref. [101] (these are seemingly the easiest quantum this circuit and a procedure called coherent controliza-
states to prepare, as they do not require ancillas). For the tion [143,144], it is possible to implement the coherent
average energy estimation subroutine, we can either take a access to the columns of the transition matrix P using
phase estimation approach as in Fig. 3 of Ref. [100], which O[M × C(P) + poly(M )] operations, same as classically.
requires log(d) + 1 ancillas and Õ (H /ε) iterations, or This implementation goes as follows:
⊗ log(M )
|i1 |0⊗M ⊗M −2
aux |0aux |02
⊗ log(M )
→ |i1 |Pi1 · · · |PiM |0⊗M
aux
−2
|02 # costM × C(P) + O[poly(M )]
5 5
⊗ log(M )
→ |i1 |Pi1 · · · |PiM Pi1 + · · · + Pi M · · · Pi, M −1 + Pi M |02 # cost M − 2 additions
2 2 2
5 5
M

→ |i1 |Pi1 · · · |PiM Pi1 + · · · + Pi M · · · Pi, M −1 + Pi M Pij |j 2 # cost of CC: O[poly(M )]
2 2 2
j =1

M

→ |i1 |0⊗M ⊗M −2
aux |0aux Pij |j 2 # cost M × C(P) + O[poly(M )].
j =1
010328-27
APPENDIX F: DETAILS OF THE NUMERICAL we adapt the size of the two hidden layers of the networks
SIMULATIONS such that both layers have the same number of hidden
units. We choose ADAM as an optimizer with a learning
For reference, the Python code used for all simulations
rate of α = 0.001. The training batches of the neural net-
is available on GitHub [94].
work contain ten samples. The agent’s policy is a softmax
For the simulations of GridWorld, we chose ADAM
policy with a β parameter set to 1.
as an optimizer with a learning rate of α = 0.0001. The
For the simulations of the reward-discounting experi-
training batches of the main network contain 100 samples
ment, in order to have a fair comparison of performance
and the replay memory has a capacity of 5000 interactions,
between DEBNs and DQNs, we choose to keep their total
updated in a first-in-first-out manner. The target network
number of weights fixed to 2060, i.e., such that the min-
is updated every 100 calls to the replay memory while the
imal number of hidden units per layer of the DQNs are
experience replay memory is called every 100 timesteps.
ten in all instances of the experiment. For a given number
We fix the glow parameters of the PS update rule to a value
of actions, we adapt the size of the two hidden layers of
of 0.99. The agent’s policy is a softmax policy and for the
the networks such that both layers have the same number
β parameter we choose a hyperbolic tangent schedule that
of hidden units. We chose ADAM as an optimizer with
starts with a value of 0.001 and ends with a value of 0.664
a learning rate of α = 0.01. The training batches of the
(hyperbolic tangent applied to a linear schedule that starts
main network contain ten samples and the replay memory
with a value of 0.001 and ends with a value of 0.8). The
has a capacity of 2000 interactions. The target network is
neural networks have 64 units per layer.
updated every 100 timesteps while the experience replay
For the simulations of CartPole, in order to have a fair
memory is updated (along with the main network) every
comparison of the performance of neural networks with
ten timesteps (in a first-in-first-out manner).
different depths, we set their number of units per hidden
For the simulations of the circular GridWorld experi-
layer such that all neural networks have approximately
ment, in order to have a fair comparison of performance
the same total number of weights and biases. More pre-
between DEBNs and DQNs, we choose to keep their total
cisely, we set the number of units per layer to be 10, 19, 73
number of weights fixed to 2660, i.e., such that the min-
for neural networks with, respectively, 5, 2, and 1 hidden
imal number of hidden units per layer of the DQNs are
layer(s), leading to a total number of weights and biases of
ten in all instances of the experiment. For a given number
525, 528, 526, respectively.
of actions, we adapt the size of the two hidden layers of
For neural networks with five hidden layers, we choose
the networks such that both layers have the same number
ADAM as an optimizer with a learning rate of α = 0.001.
of hidden units. We choose ADAM as an optimizer with
The training batches of the main network contain 100
a learning rate of α = 0.01. The training batches of the
samples and the replay memory has a capacity of 2000
main network contain 100 samples and the replay memory
interactions, updated in a first-in first-out manner. The tar-
has a capacity of 5000 interactions. The target network is
get network is updated every 10 000 calls to the replay
updated every 500 timesteps while the experience replay
memory while the experience replay memory is called
memory is updated (along with the main network) every
every five timesteps. For neural networks with one and two
ten timesteps (in a first-in-first-out manner). The agent’s
hidden layer(s), we choose ADAM as an optimizer with a
policy is a softmax policy and for the β parameter we
learning rate of α = 0.01. The training batches of the main
choose a linear schedule that starts with a value of 1 and
network contain 200 samples and the replay memory has a
ends with a value of 100.
capacity of 2000 interactions, updated in a first-in-first-out
manner. The target network is updated every 5000 calls to
the replay memory while the experience replay memory is
called every five timesteps. For all simulations, we fix the [1] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N.
glow parameters of the PS update rule to a value of 0.99. Wiebe, and S. Lloyd, Quantum machine learning, Nature
The agent’s policy is a softmax policy and for the β param- 549, 195 (2017).
eter we choose a hyperbolic tangent schedule that starts [2] V. Dunjko and H. J. Briegel, Machine learning & arti-
with a value of 0.01 and ends with a value of 1 (hyperbolic ficial intelligence in the quantum domain: A review
of recent progress, Rep. Prog. Phys. 81, 074001
tangent applied to a linear schedule that starts with a value
(2018).
of 0.01 and ends with a value of 10). [3] C. Ciliberto, M. Herbster, A. D. Ialongo, M. Pontil, A.
For the simulations of the policy-sampling experiment, Rocchetto, S. Severini, and L. Wossnig, Quantum machine
in order to have a fair comparison of performance between learning: A classical perspective, Proc. R. Soc. A 474,
DEBNs and DQNs, we choose to keep their total number 20170551 (2018).
of weights fixed to 10440, i.e., such that the minimal num- [4] M. Schuld, I. Sinayskiy, and F. Petruccione, An introduc-
ber of hidden units par layer of the DQNs are 10 in all tion to quantum machine learning, Contemp. Phys. 56,
instances of the experiment. For a given number of actions, 172 (2015).
010328-28
[5] J. Adcock, E. Allen, M. Day, S. Frick, J. Hinchliff, [24] T. Fösel, P. Tighineanu, T. Weiss, and F. Marquardt, Rein-
M. Johnson, S. Morley-Short, S. Pallister, A. Price, forcement Learning with Neural Networks for Quantum
and S. Stanisic, Advances in quantum machine learning, Feedback, Phys. Rev. X 8, 031084 (2018).
arXiv:1512.02900 (2015). [25] Z. T. Wang, Y. Ashida, and M. Ueda, Deep Reinforcement
[6] M. Schuld, Supervised Learning with Quantum Comput- Learning Control of Quantum Cartpoles, Phys. Rev. Lett.
ers (Springer, Cham, 2018). 125, 100401 (2020).
[7] H. Neven, V. S. Denchev, G. Rose, and W. G. Macready, [26] L. J. Fiderer, J. Schuff, and D. Braun, Neural-network
Training a binary classifier with the quantum adiabatic heuristics for adaptive bayesian quantum estimation,
algorithm, arXiv:0811.0416 (2008). arXiv:2003.02183 (2020).
[8] S. Lloyd, S. Garnerone, and P. Zanardi, Quantum algo- [27] J. Yao, M. Bukov, and L. Lin, Policy gradient
rithms for topological and geometric analysis of data, based quantum approximate optimization algorithm,
Nat. Commun. 7, 1 (2016). arXiv:2002.01068 (2020).
[9] P. Rebentrost, M. Mohseni, and S. Lloyd, Quantum [28] P. Andreasson, J. Johansson, S. Liljestrand, and M.
Support Vector Machine for Big Data Classification, Granath, Quantum error correction for the toric code using
Phys. Rev. Lett. 113, 130503 (2014). deep reinforcement learning, Quantum 3, 183 (2019).
[10] S. Lloyd, M. Mohseni, and P. Rebentrost, Quantum prin- [29] H. P. Nautrup, N. Delfosse, V. Dunjko, H. J. Briegel, and
cipal component analysis, Nat. Phys. 10, 631 (2014). N. Friis, Optimizing quantum error correction codes with
[11] N. Wiebe, D. Braun, and S. Lloyd, Quantum Algorithm reinforcement learning, Quantum 3, 215 (2019).
for Data Fitting, Phys. Rev. Lett. 109, 050505 (2012). [30] R. Sweke, M. Kesselring, E. Van Nieuwenburg, and
[12] J. Preskill, Quantum computing in the NISQ era and J. Eisert, Reinforcement learning decoders for fault-
beyond, Quantum 2, 79 (2018). tolerant quantum computation, Machine Learning: Sci.
[13] M. H. Amin, E. Andriyash, J. Rolfe, B. Kulchytskyy, and Technol. 2, 025005 (2020).
R. Melko, Quantum Boltzmann Machine, Phys. Rev. X 8, [31] G. Dulac-Arnold, D. Mankowitz, and T. Hester, Chal-
021050 (2018). lenges of real-world reinforcement learning, arXiv:1904.
[14] M. Schuld, A. Bocharov, K. M. Svore, and N. Wiebe, 12901 (2019).
Circuit-centric quantum classifiers, Phys. Rev. A 101, [32] Y. Lecun, S. Chopra, R. Hadsell, M. A. Ranzato, and
032308 (2020). F. J. Huang, in Predicting Structured Data (MIT Press,
[15] I. Kerenidis, J. Landman, A. Luongo, and A. Prakash, Cambridge, 2006).
in Advances in Neural Information Processing Systems [33] H. J. Kappen, Learning quantum models from quantum or
(Curran Associates, Red Hook, 2019), p. 4134. classical data, J. Phys. A: Math. Theor. 53, 214001 (2020).
[16] B. T. Kiani, A. Villanyi, and S. Lloyd, Quantum medical [34] N. Wiebe, A. Kapoor, and K. M. Svore, Quantum deep
imaging algorithms, arXiv:2004.02036 (2020). learning, Quantum Inf. Comput. 16, 541 (2016).
[17] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, [35] N. Wiebe and L. Wossnig, Generative training of quantum
A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Boltzmann machines with hidden units, arXiv:1905.09902
Bolton, et al., Mastering the game of go without human (2019).
knowledge, Nature 550, 354 (2017). [36] J.-G. Liu and L. Wang, Differentiable learning of quan-
[18] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. tum circuit born machines, Phys. Rev. A 98, 062324
Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Grae- (2018).
pel, et al., A general reinforcement learning algorithm that [37] D. Zhu, N. M. Linke, M. Benedetti, K. A. Lands-
masters chess, shogi, and go through self-play, Science man, N. H. Nguyen, C. H. Alderete, A. Perdomo-Ortiz,
362, 1140 (2018). N. Korda, A. Garfoot, C. Brecque, et al., Training of quan-
[19] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. A. Sal- tum circuits on a hybrid quantum computer, Sci. Adv. 5,
lab, S. Yogamani, and P. Pérez, Deep reinforcement learn- eaaw9918 (2019).
ing for autonomous driving: A survey, arXiv:2002.00444 [38] W. Huggins, P. Patil, B. Mitchell, K. B. Whaley, and
(2020). E. M. Stoudenmire, Towards quantum machine learning
[20] M. Popova, O. Isayev, and A. Tropsha, Deep reinforce- with tensor networks, Quantum Sci. Technol. 4, 024001
ment learning for de novo drug design, Sci. Adv. 4, (2019).
eaap7885 (2018). [39] M. Kieferová and N. Wiebe, Tomography and generative
[21] A. A. Melnikov, H. P. Nautrup, M. Krenn, V. Dunjko, training with quantum Boltzmann machines, Phys. Rev. A
M. Tiersch, A. Zeilinger, and H. J. Briegel, Active learn- 96, 062327 (2017).
ing machine learns to create new quantum experiments, [40] Y. Du, M.-H. Hsieh, T. Liu, and D. Tao, The
Proc. Natl. Acad. Sci. 115, 1221 (2018). expressive power of parameterized quantum circuits,
[22] A. A. Melnikov, P. Sekatski, and N. Sangouard, Setting arXiv:1810.11922 (2018).
up Experimental Bell Tests with Reinforcement Learning, [41] X. Gao, Z.-Y. Zhang, and L.-M. Duan, A quantum
Phys. Rev. Lett. 125, 160401 (2020). machine learning algorithm based on generative models,
[23] M. August and J. M. Hernández-Lobato, in International Sci. Adv. 4, eaat9004 (2018).
Conference on High Performance Computing (Springer, [42] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu,
Cham, 2018), p. 591. J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,
010328-29
A. K. Fidjeland, G. Ostrovski, et al., Human-level con- [60] V. Dunjko, J. M. Taylor, and H. J. Briegel, Frame-
trol through deep reinforcement learning, Nature 518, 529 work for learning agents in quantum environments,
(2015). arXiv:1507.08482 (2015).
[43] B. Sallans and G. E. Hinton, Reinforcement learning with [61] A. Cornelissen, Quantum Gradient Estimation and its
factored states and actions, J. Mach. Learn. Res. 5, 1063 Application to Quantum Reinforcement Learning (TU
(2004). Delft Library, Delft, 2018).
[44] R. S. Sutton and A. G. Barto, Reinforcement Learning: An [62] V. Dunjko, Y.-K. Liu, X. Wu, and J. M. Taylor, Exponen-
Introduction (The MIT Press, Cambridge, 1998). tial improvements for quantum-accessible reinforcement
[45] H. J. Briegel and G. De las Cuevas, Projective simulation learning, arXiv:1710.11160 (2017).
for artificial intelligence, Sci. Rep. 2, 400 (2012). [63] G. D. Paparo, V. Dunjko, A. Makmal, M. A. Martin-
[46] Note that, in standard VBMs, states and actions are part Delgado, and H. J. Briegel, Quantum Speedup for Active
of a mathematical object describing the agent’s task envi- Learning Agents, Phys. Rev. X 4, 031002 (2014).
ronment: a Markov decision process. In PS, although [64] D. Crawford, A. Levit, N. Ghadermarzy, J. S. Oberoi,
similar to VBMs in its two-layer form, the perspective and P. Ronagh, Reinforcement learning using quantum
is different and more agent centered: representations of Boltzmann machines, Quantum Inf. Comput. 18, 51
states and actions have a separate status as entities in the (2018).
agent’s memory, they can be created, erased, or modified [65] A. Levit, D. Crawford, N. Ghadermarzy, J. S. Oberoi,
as part of the agent’s dynamics of memory processing and E. Zahedinejad, and P. Ronagh, Free energy-based
deliberation. reinforcement learning using a quantum processor,
[47] M. Bukov, A. G. R. Day, D. Sels, P. Weinberg, arXiv:1706.00074 (2017).
A. Polkovnikov, and P. Mehta, Reinforcement Learning [66] F. Neukart, D. Von Dollen, C. Seidel, and G. Com-
in Different Phases of Quantum Control, Phys. Rev. X 8, postella, Quantum-enhanced reinforcement learning for
031086 (2018). finite-episode games with discrete state spaces, Front.
[48] M. Bukov, Reinforcement learning for autonomous prepa- Phys. 5, 71 (2018).
ration of Floquet-engineered states: Inverting the quantum [67] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G.
kapitza oscillator, Phys. Rev. B 98, 224305 (2018). Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and
[49] P. Palittapongarnpim, P. Wittek, and B. C. Sanders, in D. Silver, in Thirty-Second AAAI Conference on Artificial
24th European Symposium on Artificial Neural Networks, Intelligence (AAAI Press, Palo Alto, 2018).
Bruges, April 27–29, 2016 (Ciaco, Louvain-la-Neuve, [68] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, in Pro-
2016), p. 327. ceedings of the 34th International Conference on Machine
[50] S. S. Vedaie, P. Palittapongarnpim, and B. C. Sanders, in Learning-Volume 70 (JMLR. org, Cambridge, 2017), p.
2018 IEEE Photonics Society Summer Topical Meeting 1352.
Series (SUM) (IEEE, Piscataway, 2018), p. 163. [69] G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag,
[51] R. Porotti, D. Tamascelli, M. Restelli, and E. Prati, Coher- T. Lillicrap, J. Hunt, T. Mann, T. Weber, T. Degris, and
ent transport of quantum states by deep reinforcement B. Coppin, Deep reinforcement learning in large discrete
learning, Commun. Phys. 2, 61 (2019). action spaces, arXiv:1512.07679 (2015).
[52] J. Wallnöfer, A. A. Melnikov, W. Dür, and H. J. Briegel, [70] G. E. Hinton and T. J. Sejnowski, in Proceedings of
Machine learning for long-distance quantum communica- the IEEE Conference on Computer Vision and Pattern
tion, PRX Quantum 1, 010301 (2020). Recognition (Citeseer, Silver Spring, 1983), p. 448.
[53] H. Xu, J. Li, L. Liu, Y. Wang, H. Yuan, and X. [71] E. Ising, Contribution to the theory of ferromagnetism, Z.
Wang, Transferable control for quantum parameter esti- Phys. 31, 253 (1925).
mation through reinforcement learning, arXiv:1904.11298 [72] N. Heess, D. Silver, and Y. W. Teh, in Proceedings of
(2019). the Tenth European Workshop on Reinforcement Learning
[54] M. Dalgaard, F. Motzoi, J. J. Sorensen, and J. Sherson, Vol. 24 (PMLR, Cambridge, 2013), p. 45.
Global optimization of quantum dynamics with alphazero [73] M. Otsuka, J. Yoshimoto, and K. Doya, in ESANN (D-side,
deep exploration, npj Quantum Inf. 6, 1 (2020). Evere, 2010).
[55] P. Ronagh, Quantum algorithms for solving dynamic pro- [74] S. Elfwing, E. Uchibe, and K. Doya, From free energy to
gramming problems, arXiv:1906.02229 (2019). expected energy: Improving energy-based value function
[56] V. Dunjko, J. M. Taylor, and H. J. Briegel, in 2017 approximation in reinforcement learning, Neural Netw.
IEEE International Conference on Systems, Man, and 84, 17 (2016).
Cybernetics (SMC) (IEEE, Piscataway, 2017), p. 282. [75] J. Martens, A. Chattopadhya, T. Pitassi, and R. Zemel,
[57] L. K. Grover, in Proceedings of the Twenty-Eighth in Advances in Neural Information Processing Systems
Annual ACM Symposium on Theory of Computing (Asso- (Curran Associates, Red Hook, 2013), p. 2877.
ciation for Computing Machinery, New York, 1996), [76] J. Ngiam, Z. Chen, P. W. Koh, and A. Y. Ng, in Pro-
p. 212. ceedings of the 28th International Conference on Machine
[58] G. Brassard, M. Mosca, and A. Tapp, Quantum ampli- Learning (ICML-11) (International Machine Learning
tude amplification and estimation, Contemp. Math. 305, Society, Madison, 2011), p. 1105.
53 (2002). [77] Y. Du and I. Mordatch, in Advances in Neural Information
[59] V. Dunjko, J. M. Taylor, and H. J. Briegel, Quantum- Processing Systems, edited by H. Wallach, H. Larochelle,
Enhanced Machine Learning, Phys. Rev. Lett. 117, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett
130501 (2016). (Curran Associates, Inc., 2019), Vol. 32, p. 3608.
010328-30
[78] P. M. Long and R. Servedio, in Proceedings of the 27th [97] A. W. Harrow and A. Y. Wei, in Proceedings of the
International Conference on Machine Learning (ICML- Fourteenth Annual ACM-SIAM Symposium on Discrete
10) (Springer, Berlin, 2010), p. 703. Algorithms (SIAM, Philadelphia, 2020), p. 193.
[79] G. E. Hinton, in Neural Networks: Tricks of the Trade [98] M.-H. Yung and A. Aspuru-Guzik, A quantum–quantum
(Springer, Berlin, 2012), p. 599. metropolis algorithm, Proc. Natl. Acad. Sci. 109, 754
[80] Under maximum-likelihood learning. In practice, a dif- (2012).
ferent gradient is used to train RBMs, derived by the [99] J. Van Apeldoorn, A. Gilyén, S. Gribling, and R. de Wolf,
contrastive divergence method [145,146]. Even though Quantum sdp-solvers: Better upper and lower bounds,
this “gradient” is computed more efficiently, it does not Quantum 4, 230 (2020).
correspond to the gradient of any true objective function [100] A. N. Chowdhury, G. H. Low, and N. Wiebe, A variational
[147], it can lead to very large approximation errors in quantum algorithm for preparing quantum gibbs states,
the learned distributions [148–150] and cannot be used to arXiv:2002.00055 (2020).
train full (all-to-all connected) Boltzmann machines [34]. [101] J. Wu and T. H. Hsieh, Variational Thermal Quantum Sim-
[81] The approximation comes in SARSA and Q-learning from ulation via Thermofield Double States, Phys. Rev. Lett.
considering Q(s(t+1) , a) constant with respect to θ, which 123, 220502 (2019).
is an assumption made in temporal difference learning. [102] Y. Wang, G. Li, and X. Wang, Variational quantum
[82] N. Le Roux and Y. Bengio, Representational power of gibbs state preparation with a truncated taylor series,
restricted Boltzmann machines and deep belief networks, arXiv:2005.08797 (2020).
Neural Comput. 20, 1631 (2008). [103] D. J. Aldous, Some inequalities for reversible markov
[83] G. F. Montúfar, J. Rauh, and N. Ay, in Advances in Neural chains, J. London Math. Soc. 2, 564
Information Processing Systems (Curran Associates, Red (1982).
Hook, 2011), p. 415. [104] N. Srivastava and R. R. Salakhutdinov, in Advances in
[84] V. Vapnik and A. Y. Chervonenkis, On the uniform Neural Information Processing Systems (Curran Asso-
convergence of relative frequencies of events to their ciates, Red Hook, 2012), p. 2222.
probabilities, Measures Complexity 16, 11 (1971). [105] X. Gao and L.-M. Duan, Efficient representation of quan-
[85] Originally used to quantify the expressive power of clas- tum many-body states with deep neural networks, Nat.
sifiers, its definition can also be extended to regression Commun. 8, 1 (2017).
models. [106] D. W. Berry and A. M. Childs, Black-box hamiltonian
[86] N. Harvey, C. Liaw, and A. Mehrabian, in Confer- simulation and unitary implementation, arXiv:0910.4157
ence on Learning Theory (PMLR, Amsterdam, 2017), (2009).
p. 1064. [107] For QBMs, we are as well interested in Gibbs distributions
[87] S. Liang and R. Srikant, Why deep neural networks for for configurations {x} that are not part of the eigenba-
function approximation? arXiv:1610.04161 (2016). sis of H , leading to a different expression of Eq. (15)
[88] R. Salakhutdinov and G. Hinton, in Artificial in Telligence that involves the eigendecomposition of H . For classical
and Statistics (PMLR, Clearwater Beach, 2009), p. 448. Hamiltonians (e.g., DBMs), these eigenvectors are simply
[89] G. E. Hinton, S. Osindero, and Y.-W. Teh, A fast learning computational basis states.
algorithm for deep belief nets, Neural Comput. 18, 1527 [108] Actually, the general goal of purified Gibbs state prepara-
(2006). tion is to prepare a state |φA B such that TrB |φ φ|A B ≈
[90] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learn- |ψ ψ| = e− βH /Tre− βH . The state of Eq. (16) is a par-
ing representations by back-propagating errors, Cognitive ticular instance, called a coherent encoding, which can be
Model. 5, 1 (1986). more useful than a general purified Gibbs state, e.g., when
[91] J. N. Tsitsiklis and B. Van Roy, in Advances in Neural estimating the partition function of the associated Gibbs
Information Processing Systems (Kaufmann, San Mateo, distribution, Eq. (15) [122].
1997), p. 1075. [109] C. Durr and P. Hoyer, A quantum algorithm for finding the
[92] J. Fu, A. Kumar, M. Soh, and S. Levine, Diagnosing Bot- minimum, quant-ph/9607014 (1996).
tlenecks in Deep Q-learning Algorithms (PMLR, Long [110] G. H. Low and I. L. Chuang, Hamiltonian simulation by
Beach, California, USA, 2019), p. 2021. qubitization, Quantum 3, 163 (2019).
[93] Note that, combined with the policy-sampling effect, we [111] T. J. Yoder, G. H. Low, and I. L. Chuang, Fixed-Point
also have that the target function is moving, as the policy Quantum Search with an Optimal Number of Queries,
of the agent changes over training. Phys. Rev. Lett. 113, 210501 (2014).
[94] L. M. Trenkwalder, S. Jerbi, and H. Poulsen Nautrup, [112] M. Szegedy, in FOC’04: Proceedings of the 45th Annual
Quantum-Enhanceable Deep Reinforcement Learning IEEE Symposium on Foundations of Computer Science
(GitHub, 2020). https://github.com/LeaMarion/quantum- (IEEE, Los Alamitos, 2004), p. 32.
enhanceable-deep-rl. [113] F. Magniez, A. Nayak, J. Roland, and M. Santha, Search
[95] A. A. Melnikov, A. Makmal, and H. J. Briegel, Bench- via quantum walk, SIAM J. Comput. 40, 142 (2011).
marking projective simulation in navigation problems, [114] P. Wocjan and A. Abeyesinghe, Speedup via quantum
IEEE Access 6, 64639 (2018). sampling, Phys. Rev. A 78, 042336 (2008).
[96] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, [115] D. Orsucci, H. J. Briegel, and V. Dunjko, Faster quantum
J. Schulman, J. Tang, and W. Zaremba, Openai gym, mixing for slowly evolving sequences of markov chains,
arXiv:1606.01540 (2016). Quantum 2, 105 (2018).
010328-31
[116] This is also the reason why we deem classical Gibbs [133] S. Lloyd, M. Schuld, A. Ijaz, J. Izaac, and N.
sampling to be inapplicable to QBMs in Table I, and Killoran, Quantum embeddings for machine learning,
why exact sampling has larger complexity for QBMs arXiv:2001.03622 (2020).
than DBMs: one should additionally diagonalize the QBM [134] I. A. Murray, Ph.D. thesis, University of London, 2007.
Hamiltonian to be able to exponentiate it. [135] Y. Iba, N. Saito, and A. Kitajima, Multicanonical MCMC
[117] K. Temme, T. J. Osborne, K. G. Vollbrecht, D. Poulin, and for sampling rare events: An illustrative review, Ann. Inst.
F. Verstraete, Quantum metropolis sampling, Nature 471, Stat. Math. 66, 611 (2014).
87 (2011). [136] Intuitively, this has to do with the fact that all eigen-
[118] R. D. Somma, S. Boixo, H. Barnum, and E. Knill, values modulus of P are strictly less than 1 (except for
Quantum Simulations of Classical Annealing Processes, the eigenvalue 1 associated to the stationary distribution),
Phys. Rev. Lett. 101, 130504 (2008). and repeated applications of P make its (noneigenvalue 1)
[119] Y. Subaşı, L. Cincio, and P. J. Coles, Entanglement spec- eigenspaces fade away with a geometric rate correspond-
troscopy with a depth-two quantum circuit, J. Phys. A: ing to their eigenvalue modulus. The slowest fading rate
Math. Theor. 52, 044001 (2019). is given by the second largest eigenvalue modulus.
[120] J. Nocedal and S. Wright, Numerical Optimization [137] W. K. Hastings, Monte Carlo sampling methods using
(Springer Science & Business Media, New York, 2006). Markov chains and their applications, Biometrika 57, 97
[121] M. R. Jerrum, L. G. Valiant, and V. V. Vazirani, Random (1970).
generation of combinatorial structures from a uniform [138] S. Geman and D. Geman, Stochastic relaxation, Gibbs dis-
distribution, Theor. Comput. Sci. 43, 169 (1986). tributions, and the Bayesian restoration of images, IEEE
[122] P. Wocjan, C.-F. Chiang, A. Abeyesinghe, and D. Nagaj, Trans. Pattern Anal. Mach. Intelligence 6, 721 (1984).
Quantum speed-up for approximating partition functions, [139] Note that the asterisk on P ∗ does not indicate complex
arXiv:0811.0596 (2008). conjugation.
[123] E. Campbell, A. Khurana, and A. Montanaro, Applying [140] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, A learn-
quantum algorithms to constraint satisfaction problems, ing algorithm for Boltzmann machines, Cogn. Sci. 9, 147
Quantum 3, 167 (2019). (1985).
[124] R. Babbush, J. McClean, C. Gidney, S. Boixo, and [141] A. Fischer and C. Igel, in Iberoamerican Congress on
H. Neven, Focus beyond quadratic speedups for Pattern Recognition (Springer, Berlin, 2012), p. 14.
error-corrected quantum advantage, arXiv:2011.04149 [142] M. A. Nielsen and I. L. Chuang, Quantum Computation
(2020). and Quantum Information (Cambridge University Press,
[125] S. Endo, S. C. Benjamin, and Y. Li, Practical Quantum Cambridge, 2010).
Error Mitigation for Near-Future Applications, Phys. Rev. [143] V. Dunjko, N. Friis, and H. J. Briegel, Quantum-enhanced
X 8, 031027 (2018). deliberation of learning agents using trapped ions, New J.
[126] B. Koczor, Exponential error suppression for near-term Phys. 17, 023006 (2015).
quantum devices, arXiv:2011.05942 (2020). [144] L. Grover and T. Rudolph, Creating superpositions that
[127] R. LaRose and B. Coyle, Robust data encodings for correspond to efficiently integrable probability distribu-
quantum classifiers, arXiv:2003.01695 (2020). tions, quant-ph/0208112 (2002).
[128] E. Farhi and H. Neven, Classification with quantum neu- [145] M. Welling and G. E. Hinton, in International Confer-
ral networks on near term processors, arXiv:1802.06002 ence on Artificial Neural Networks (Springer, New York,
(2018). 2002), p. 351.
[129] S. Y.-C. Chen, C.-H. H. Yang, J. Qi, P.-Y. Chen, X. [146] M. A. Carreira-Perpinan and G. E. Hinton, in Proceed-
Ma, and H.-S. Goan, Variational quantum circuits for ings of the Tenth International Workshop on Artificial
deep reinforcement learning, IEEE Access 8, 141007 Intelligence and Statistics, edited by R. G. Cowell and
(2020). Z. Ghahramani (Society for Artificial Intelligence and
[130] M. Schuld, R. Sweke, and J. J. Meyer, The effect of Statistics, Fort Lauderdale, 2005), Vol. 10, p. 33.
data encoding on the expressive power of variational [147] I. Sutskever and T. Tieleman, in Proceedings of the Thir-
quantum machine learning models, arXiv:2008.08605 teenth International Conference on Artificial Intelligence
(2020). and Statistics (PMLR, Sardinia, 2010), p. 789.
[131] V. Havlíček, A. D. Córcoles, K. Temme, A. W. Harrow, [148] A. Fischer and C. Igel, Bounding the bias of contrastive
A. Kandala, J. M. Chow, and J. M. Gambetta, Supervised divergence learning, Neural Comput. 23, 664 (2011).
learning with quantum-enhanced feature spaces, Nature [149] T. Tieleman and G. Hinton, in Proceedings of the 26th
567, 209 (2019). Annual International Conference on Machine Learning
[132] M. Schuld and N. Killoran, Quantum Machine Learning (MIT Press, Cambridge, 2009), p. 1033.
in Feature Hilbert Spaces, Phys. Rev. Lett. 122, 040504 [150] Y. Bengio and O. Delalleau, Justifying and generalizing
(2019). contrastive divergence, Neural Comput. 21, 1601 (2009).
010328-32

PRXQuantum 2 010328

Uploaded by

Copyright:

Available Formats

PRXQuantum 2 010328

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PRXQuantum 2 010328

Uploaded by

Copyright:

Available Formats

PRX QUANTUM 2, 010328 (2021)

Quantum Enhancements for Deep Reinforcement Learning in Large Spaces

(Received 6 October 2020; accepted 12 January 2021; published 22 February 2021)

I. INTRODUCTION settings have also stimulated the developments of new

2691-3399/21/2(1)/010328(32) 010328-1 Published by the American Physical Society

We achieve this by utilizing computationally demand-

1. We introduce deep energy reinforcement learn-

of the neural network followed by the application of a sin-

FIG. 3. Restricted Boltzmann machine used for reinforcement

as updates on θ act on many values at a time, which allows

poor generalization performance in DQN-type properties are a consequence of statistical ﬂuc-

Sampling → Classical Near/Mid-term quantum Fault-tolerant quantum

that is the sum of the expected energy and Shannon entropy  

2. Semitransverse quantum Boltzmann machines ÔF |j  |l = |j  |f (j , l) .

restatement of their result, to Appendix C. Note that,  

3. Variational Gibbs state preparation

where P ∗ is the time-reversed MC [139] associated to P.

With the quantum diﬀusion operators UP and VP at hand, distribution:

APPENDIX B: BOLTZMANN MACHINES AND = E θ [v, hP(h|v) ] − H (h|v)

Starting from the expression of a RBM probability distribution:

we end up with the σ z and σ z σ z operators from the σ x operators in the

4. Derivation of the PS error function

queries, and  ∂H (s, a, h)

gates. where ∂θ H is the σlz or σlz σlz operator associated to the

property ∂θ e−H = −∂θ He−H with a diagonal Hamiltonian

and hence an alternative approach that only requires measuring the

You might also like

that is the sum of the expected energy and Shannon entropy

2. Semitransverse quantum Boltzmann machines ÔF |j |l = |j |f (j , l) .

restatement of their result, to Appendix C. Note that,

APPENDIX B: BOLTZMANN MACHINES AND = E θ [v, hP(h|v) ] − H (h|v)

queries, and ∂H (s, a, h)