Parametrized Quantum Policies For Reinforcement Learning

Parametrized Quantum Policies
for Reinforcement Learning
Sofiene Jerbi Casper Gyurik Simon C. Marshall

Institute for Theoretical Physics, LIACS, LIACS,
University of Innsbruck Leiden University Leiden University
arXiv:2103.05577v2 [quant-ph] 9 Dec 2021
[email protected]
Hans J. Briegel Vedran Dunjko

Institute for Theoretical Physics, LIACS,
University of Innsbruck Leiden University
Abstract
With the advent of real-world quantum computing, the idea that parametrized
quantum computations can be used as hypothesis families in a quantum-classical
machine learning system is gaining increasing traction. Such hybrid systems have
already shown the potential to tackle real-world tasks in supervised and generative
learning, and recent works have established their provable advantages in special
artificial tasks. Yet, in the case of reinforcement learning, which is arguably most
challenging and where learning boosts would be extremely valuable, no proposal
has been successful in solving even standard benchmarking tasks, nor in showing a
theoretical learning advantage over classical algorithms. In this work, we achieve
both. We propose a hybrid quantum-classical reinforcement learning model using
very few qubits, which we show can be effectively trained to solve several standard
benchmarking environments. Moreover, we demonstrate, and formally prove, the
ability of parametrized quantum circuits to solve certain learning tasks that are
intractable to classical models, including current state-of-art deep neural networks,
under the widely-believed classical hardness of the discrete logarithm problem.
1 Introduction
Hybrid quantum machine learning models constitute one of the most promising applications of
near-term quantum computers [1, 2]. In these models, parametrized and data-dependent quantum
computations define a hypothesis family for a given learning task, and a classical optimization
algorithm is used to train them. For instance, parametrized quantum circuits (PQCs) [3] have already
proven successful in classification [4–8], generative modeling [9, 10] and clustering [11] problems.
Moreover, recent results have shown proofs of their learning advantages in artificially constructed
tasks [6, 12], some of which are based on widely believed complexity-theoretic assumptions [12–15].
All these results, however, only consider supervised and generative learning settings.
Arguably, the largest impact quantum computing can have is by providing enhancements to the
hardest learning problems. From this perspective, reinforcement learning (RL) stands out as a field
that can greatly benefit from a powerful hypothesis family. This is showcased by the boost in learning
performance that deep neural networks (DNNs) have provided to RL [16], which enabled systems
like AlphaGo [17], among other achievements [18, 19]. Nonetheless, the true potential of near-term
quantum approaches in RL remains very little explored. The few existing works [20–23] have failed
so far at solving classical benchmarking tasks using PQCs and left open the question of their ability
to provide a learning advantage.
35th Conference on Neural Information Processing Systems (NeurIPS 2021).

𝑠
𝜽 𝑠, 𝜽 𝜽 𝑠, 𝜽 𝜽 𝑎
|0ۧ
𝜋𝜽 (𝑎|𝑠)
|0ۧ 𝑂𝑎 𝑠,𝜽
∇𝜽 log 𝜋𝜽
|0ۧ
Figure 1: Training parametrized quantum policies for reinforcement learning. We consider a

quantum-enhanced RL scenario where a hybrid quantum-classical agent learns by interacting with a
classical environment. For each state s it perceives, the agent samples its next action a from its policy
πθ (a|s) and perceives feedback on its behavior in the form of a reward r. For our hybrid agents,
the policy πθ is specified by a PQC (see Def. 1) evaluated (along with the gradient ∇θ log πθ ) on
a quantum processing unit (QPU). The training of this policy is performed by a classical learning
algorithm, such as the REINFORCE algorithm (see Alg. 1), which uses sample interactions and
policy gradients to update the policy parameters θ.
Contributions In this work, we demonstrate the potential of policies based on PQCs in solving
classical RL environments. To do this, we first propose new model constructions, describe their learn-
ing algorithms, and show numerically the influence of design choices on their learning performance.
In our numerical investigation, we consider benchmarking environments from OpenAI Gym [24], for
which good and simple DNN policies are known, and in which we demonstrate that PQC policies can
achieve comparable performance. Second, inspired by the classification task of Havlíček et al. [6],
conjectured to be classically hard by the authors, we construct analogous RL environments where
we show an empirical learning advantage of our PQC policies over standard DNN policies used in
deep RL. In the same direction, we construct RL environments with a provable gap in performance
between a family of PQC policies and any efficient classical learner. These environments essentially
build upon the work of Liu et al. [14] by embedding into a learning setting the discrete logarithm
problem (DLP), which is the problem solved by Shor’s celebrated quantum algorithm [25] but widely
believed to be classically hard to solve [26].
Related work Recently, a few works have been exploring hybrid quantum approaches for RL.
Among these, Refs. [20, 21] also trained PQC-based agents in classical RL environments. However,
these take a value-based approach to RL, meaning that they use PQCs as value-function approximators
instead of direct policies. The learning agents in these works are also tested on OpenAI Gym
environments (namely, a modified FrozenLake and CartPole), but do not achieve sufficiently good
performance to be solving them, according to the Gym specifications. Ref. [27] shows that, using
some of our design choices for PQCs in RL (i.e., data re-uploading circuits [28] with trainable
observable weights and input scaling parameters), one can also solve these environments using a
value-based approach. An actor-critic approach to QRL was introduced in Ref. [22], using both a PQC
actor (or policy) and a PQC critic (or value-function approximator). In contrast to our work, these
are trained in quantum environments (e.g., quantum-control environments), that provide a quantum
state to the agent, which acts back with a continuous classical action. These aspects make it a very
different learning setting to ours. Ref. [23] also describes a hybrid quantum-classical algorithm for
value-based RL. The function-approximation models on which this algorithm is applied are however
not PQCs but energy-based neural networks (e.g., deep and quantum Boltzmann machines). Finally,
our work provides an alternative approach to take advantage of quantum effects in designing QRL
agents compared to earlier approaches [29–33], which are mainly based on (variations of) Grover’s
search algorithm [34] or quantum annealers [35] to speed up sampling routines.
Code An accompanying tutorial [36], implemented as part of the quantum machine learning library
TensorFlow Quantum [37], provides the code required to reproduce our numerical results and explore
different settings. It also implements the Q-learning approach for PQC-based RL of Skolik et al. [27]
2
Uvar (φ0 ) Uenc (s, λ0 )
|0i0 H Rz (φ0,0 ) Ry (φ0,2 ) • Ry (λ0,0 s0 ) Rz (λ0,2 s0 )
Uvar (φ1 )
|0i1 H Rz (φ0,1 ) Ry (φ0,3 ) • Ry (λ0,1 s1 ) Rz (λ0,3 s1 )
Figure 2: PQC architecture for n = 2 qubits and depth Denc = 1. This architecture is composed
of alternating layers of encoding unitaries Uenc (s, λi ) taking as input a state vector s = (s0 , . . . , sd−1 )
and scaling parameters λi (part of a vector λ ∈ R|λ| of dimension |λ|), and variational unitaries
Uvar (φi ) taking as input rotation angles φi (part of a vector φ ∈ [0, 2π]|φ| of dimension |φ|).
2 Parametrized quantum policies: definitions and learning algorithm

In this section, we give a detailed construction of our parametrized quantum policies and describe
their associated training algorithms. We start however with a short introduction to the basic concepts
of quantum computation, introduced in more detail in [38, 39].
2.1 Quantum computation: a primer
A quantum system composed of n qubits is represented by a 2n -dimensional complex Hilbert space

H = (C2 )⊗n . Its quantum state is described by a vector |ψi ∈ H of unit norm hψ|ψi = 1, where we
adopt the bra-ket notation to describe vectors |ψi, their conjugate transpose hψ| and inner-products
hψ|ψ 0 i in H. Single-qubit computational basis states are given by |0i = (1, 0)T , |1i = (0, 1)T , and
their tensor products describe general computational basis states, e.g., |10i = |1i ⊗ |0i = (0, 0, 1, 0).
A quantum gate is a unitary operation U acting on H. When a gate U acts non-trivially only on a
subset S ⊆ [n] of qubits, we identify it to the operation U ⊗ 1[n]\S . In this work, we are mainly
interested in the single-qubit Pauli gates Z, Y and their associated rotations Rz , Ry :

1 0 θ 0 −i θ
Z= , Rz (θ) = exp −i Z , Y = , Ry (θ) = exp −i Y , (1)
0 −1 2 i 0 2
for rotation angles θ ∈ R, and the 2-qubit Ctrl-Z gate = diag(1, 1, 1, −1).
P is described by a Hermitian operator O called an observable. Its spectral
A projective measurement
decomposition O = m αm Pm in terms of eigenvalues αm and orthogonal projections Pm defines
the outcomes of this measurement, according pto the Born rule: a measured state |ψi gives the outcome
αm and gets projected onto the state Pm |ψi / p(m) with probability p(m) = hψ| Pm |ψi = hPm iψ .
P
The expectation value of the observable O with respect to |ψi is Eψ [O] = m p(m)αm = hOiψ .
2.2 The RAW-PQC and SOFTMAX -PQC policies
At the core of our parametrized quantum policies is a PQC defined by a unitary U (s, θ) that acts on
a fixed n-qubit state (e.g., |0⊗n i). This unitary encodes an input state s ∈ Rd and is parametrized
by a trainable vector θ. Although different choices of PQCs are possible, throughout our numerical
experiments (Sec. 3 and 4.2), we consider so-called hardware-efficient PQCs [40] with an alternating-
layered architecture [28, 41]. This architecture is depicted in Fig. 2 and essentially consists in an
alternation of Denc encoding unitaries Uenc (composed of single-qubit rotations Rz , Ry ) and Denc + 1
variational unitaries Uvar (composed of single-qubit rotations Rz , Ry and entangling Ctrl-Z gates ).
For any given PQC, we define two families of policies, differing in how the final quantum states
|ψs,θ i = U (s, θ) |0⊗n i are used. In the RAW-PQC model, we exploit the probabilistic nature of
quantum measurements to define an RL policy. For |A| available actions to the RL agent, we partition
H in |A| disjoint subspaces (e.g., spanned by computational basis states) and associate a projection
P Pa
to each of these subspaces. The projective measurement associated to the observable O = a aPa
then defines our RAW-PQC policy πθ (a|s) = hPa is,θ . A limitation of this policy family however
is that it does not have a directly adjustable greediness (i.e., a control parameter that makes the
policy more peaked). This consideration arises naturally in an RL context where an agent’s policy
needs to shift from an exploratory behavior (i.e., close to uniform distribution) to a more exploitative
3
behavior (i.e., a peaked distribution). To remedy this limitation, we define the SOFTMAX -PQC model,
that applies an adjustable softmaxβ non-linear activation function on the expectation values hPa is,θ
measured on |ψs,θ i. Since the softmax function normalizes any real-valued input, we can generalize
the projections Pa to be arbitrary Hermitian operators Oa . We also generalize these observables one
step further by assigning them trainable weights. The two models are formally defined below.
Definition 1 (RAW- and SOFTMAX -PQC). Given a PQC acting on n qubits, taking as input a state
s ∈ Rd , rotation angles φ ∈ [0, 2π]|φ| and scaling parameters λ ∈ R|λ| , such that its corresponding
unitary U (s, φ, λ) produces the quantum state |ψs,φ,λ i = U (s, φ, λ) |0⊗n i, we define its associated
RAW-PQC policy as:
πθ (a|s) = hPa is,θ (2)
where hPa is,θ = hψs,φ,λ |Pa |ψs,φ,λ i is the expectation value of a projection Pa associated to action
P
a, such that a Pa = I and Pa Pa0 = δa,a0 . θ = (φ, λ) constitute all of its trainable parameters.
Using the same PQC, we also define a SOFTMAX -PQC policy as:
eβhOa is,θ
πθ (a|s) = P (3)
a0 eβhOa0 is,θ
P
where hOa is,θ = hψs,φ,λ | i wa,i Ha,i |ψs,φ,λ i is the expectation value of the weighted Hermitian
operators Ha,i associated to action a, β ∈ R is an inverse-temperature parameter and θ = (φ, λ, w).
Note that we adopt here a very general definition for the observables Oa of our SOFTMAX -PQC
policies. As we discuss in more detail in Appendix C, very expressive trainable observables can in
some extreme cases take over all training of the PQC parameters φ, λ and render the role of the PQC
in learning trivial. However, in practice,
P as well as in our numerical experiments, we only consider
very restricted observables Oa = i wa,i Ha,i , where Ha,i are (tensor products of) Pauli matrices or
high-rank projections on computational basis states, which do not allow for these extreme scenarios.
In our PQC construction, we include trainable scaling parameters λ, used in every encoding gate to
re-scale its input components. This modification to the standard data encoding in PQCs comes in
light of recent considerations on the structure of PQC functions [42]. These additional parameters
allow to represent functions with a wider and richer spectrum of frequencies, and hence provide
shallow PQCs with more expressive power.
2.3 Learning algorithm
In order to analyze the properties of our PQC policies without the interference of other learning
mechanisms [43], we train these policies using the basic Monte Carlo policy gradient algorithm
REINFORCE [44, 45] (see Alg. 1). hPThis algorithm
i consists in evaluating Monte Carlo estimates of
H−1 t
the value function Vπθ (s0 ) = Eπθ t=0 γ rt , γ ∈ [0, 1], using batches of interactions with the
environment, and updating the policy parameters θ via a gradient ascent on Vπθ (s0 ). The resulting
updates (see line 8 of Alg. 1) involve the gradient of the log-policy ∇θ log πθ (a|s), which we
therefore need to compute for our policies. We describe this computation in the following lemma.
Lemma 1. Given a SOFTMAX -PQC policy πθ , the gradient of its logarithm is given by:
X
0
∇θ log πθ (a|s) = β ∇θ hOa is,θ − 0
π θ (a |s)∇ θ hO a0i
s,θ . (4)
a
Partial derivatives with respect to observable weights are trivially given by ∂wa,i hOa is,θ =
hψs,φ,λ |Ha,i |ψs,φ,λ i (see Def. 1), while derivatives with respect to rotation angles ∂φi hOa is,θ
and scaling parameters1 ∂λi hOa is,θ can be estimated via the parameter-shift rule [46, 42]:
1
∂i hOa is,θ = hOa is,θ+ π ei − hOa is,θ− π ei , (5)
2 2 2
π
i.e., using the difference of two expectation values hOa is,θ0 with a single angle shifted by ± 2 .
For a RAW-PQC policy πθ , we have instead:
∇θ log πθ (a|s) = ∇θ hPa is,θ / hPa is,θ (6)
where the partial derivatives ∂φi hPa is,θ and ∂λi hPa is,θ can be estimated similarly to above.
1
Note that the parameters λ do not act as rotation angles. To compute the derivatives ∂λi,j hOa is,θ , one
should compute derivatives w.r.t. sj λi,j instead and apply the chain rule: ∂λi,j hOa is,θ = sj ∂sj λi,j hOa is,θ .
4
Algorithm 1: REINFORCE with PQC policies and value-function baselines
Input: a PQC policy πθ from Def. 1; a value-function approximator Veω
1 Initialize parameters θ and ω;
2 while True do
3 Generate N episodes {(s0 , a0 , r1 , . . . , sH−1 , aH−1 , rH )}i following πθ ;
4 for episode i in batch do
PH−t 0 (i)
5 Compute the returns Gi,t ← t0 =1 γ t rt+t0 ;
(i) (i)
6 Compute the gradients ∇θ log πθ (at |st ) using Lemma 1;
(i)
7 Fit Veω (st ) i,t to the returns {Gi,t }i,t ;
1 P P
N H−1
(i) (i) (i)
8 Compute ∆θ = ∇θ log πθ (at |st ) Gi,t − Veω (st ) ;
N i=1 t=0
9 Update θ ← θ + α∆θ;
In some of our environments, we additionally rely on a linear value-function baseline to reduce the
variance of the Monte Carlo estimates [47]. We choose it to be identical to that of Ref. [48].
2.4 Efficient policy sampling and policy-gradient evaluation
A natural consideration when it comes to the implementation of our PQC policies is whether one can
efficiently (in the number of executions of the PQC on a quantum computer) sample and train them.
By design, sampling from our RAW-PQC policies can be done with a single execution (and P mea-
surement) of the PQC: the projective measurement corresponding to the observable O = a aPa
naturally samples a basis state associated to action a with probability hPa is,θ . However, as Eq. (6)
indicates, in order to train these policies using REINFORCE, one is nonetheless required to estimate
the expectation values hPa is,θ , along with the gradients ∇θ hPa is,θ . Fortunately, these quantities
can be estimated efficiently up to some additive error ε, using only O(ε−2 ) repeated executions and
measurements on a quantum computer.
In the case of our SOFTMAX -PQC policies, it is less clear whether similar noisy estimates hO fa is,θ
of the expectation values hOa is,θ are sufficient to evaluate policies of the form of Eq. (3). We show
however that, using these noisy estimates, we can compute a policy π eθ that produces samples close to
that of the true policy πθ . We state our result formally in the following lemma, proven in Appendix B.
Lemma 2. For a SOFTMAX -PQC policy πθ defined by a unitary U (s, θ) and observables Oa ,
fa is,θ approximations of the true expectation values hOa i with at most ε additive error.
call hO s,θ
Then the approximate policy π eθ = softmaxβ (hO fa is,θ ) has total variation distance O(βε) to πθ =
softmaxβ (hOa is,θ ). Since expectation values can be efficiently estimated to additive error on a
quantum computer, this implies efficient approximate sampling from πθ .
We also obtain a similar result for the log-policy gradient of SOFTMAX -PQCs (see Lemma 1), that
we show can be efficiently estimated to additive error in `∞ -norm (see Appendix B for a proof).
3 Performance comparison in benchmarking environments

In the previous section, we have introduced our quantum policies and described several of our design
choices. We defined the RAW-PQC and SOFTMAX -PQC models and introduced two original features
for PQCs: trainable observables at their output and trainable scaling parameters for their input. In this
section, we evaluate the influence of these design choices on learning performance through numerical
simulations. We consider three classical benchmarking environments from the OpenAI Gym library
[24]: CartPole, MountainCar and Acrobot. All three have continuous state spaces and discrete action
spaces (see Appendix D for their specifications). Moreover, simple NN-policies, as well as simple
closed-form policies, are known to perform very well in these environments [49], which makes them
an excellent test-bed to benchmark PQC policies.
5
CartPole-v1 (n = 4, Denc = 1) MountainCar-v0 (n = 2, Denc = 4) Acrobot-v1 (n = 6, Denc = 2)
Average collected rewards

500 -60 -100
-150
400 -80
-200
-100
300 -250
-120 -300
200
-350
-140
-400
100 softmax-PQC -160 softmax-PQC softmax-PQC
-450
raw-PQC raw-PQC raw-PQC
0 -180 -500
0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000
Episode Episode Episode
Figure 3: Numerical evidence of the advantage of SOFTMAX -PQC over RAW-PQC in bench-
marking environments. The learning curves (20 agents per curve) of randomly-initialized
SOFTMAX -PQC agents (green curves) and RAW-PQC agents (red curves) in OpenAI Gym en-
vironments: CartPole-v1, MountainCar-v0, and Acrobot-v1. Each curve is temporally averaged with
a time window of 10 episodes. All agents have been trained using the REINFORCE algorithm (see
Alg. 1), with value-function baselines for the MountainCar and Acrobot environments.
500
CartPole - softmax-PQC -60
MountainCar - softmax-PQC Acrobot - softmax-PQC
-100
-80 -150
400
-200
-100
300 -250
-120
-300
200
-140 -350
Depth 5 Depth 6 Depth 5
Reference (depth 1) Reference (depth 4) -400 Reference (depth 2)
100
-160
Fixed lambdas Fixed lambdas -450 Fixed lambdas
Fixed weights, β = 2, 10 Fixed weights, β = 2, 10 Fixed weights, β = 2, 10
0 -180 -500
0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000
Figure 4: Influence of the model architecture for SOFTMAX -PQC agents. The blue curves in
each plot correspond to the learning curves from Fig. 3 and are taken as a reference. Other curves
highlight the influence of individual hyperparameters. For RAW-PQC agents, see Appendix E.
3.1 RAW-PQC v.s. SOFTMAX -PQC

In our first set of experiments, presented in Fig. 3, we evaluate the general performance of our
proposed policies. The aim of these experiments is twofold: first, to showcase that quantum policies
based on shallow PQCs and acting on very few qubits can be trained to good performance in our
selected environments; second, to test the advantage of SOFTMAX -PQC policies over RAW-PQC
policies that we conjectured in the Sec. 2.2. To assess these claims, we take a similar approach for
each of our benchmarking environments, in which we evaluate the average learning performance of 20
RAW-PQC and 20 SOFTMAX -PQC agents. Apart from the PQC depth, the shared hyperparameters
of these two models were jointly picked as to give the best overall performance for both; the
hyperparameters specific to each model were optimized independently. As for the PQC depth Denc ,
the latter was chosen as the minimum depth for which near-optimal performance was observed for
either model. The simulation results confirm both our hypotheses: quantum policies can achieve good
performance on the three benchmarking tasks that we consider, and we can see a clear separation
between the performance of SOFTMAX -PQC and RAW-PQC agents.
3.2 Influence of architectural choices
The results of the previous subsection however do not indicate whether other design choices we have
made in Sec. 2.2 had an influence on the performance of our quantum agents. To address this, we
run a second set of experiments, presented in Fig. 4. In these simulations, we evaluate the average
performance of our SOFTMAX -PQC agents after modifying one of three design choices: we either
increment the depth of the PQC (until no significant increase in performance is observed), fix the
input-scaling parameters λ to 1, or fix the observable weights w to 1. By comparing the performance
of these agents with that of the agents from Fig. 3, we can make the following observations:
6
• Influence of depth: Increasing the depth of the PQC generally improves (not strictly) the perfor-
mance of the agents. Note that the maximum depth we tested was Denc = 10.
• Influence of scaling parameters λ: We observe that training these scaling parameters in general
benefits the learning performance of our PQC policies, likely due to their increased expressivity.
• Influence of trainable observable weights w: our final consideration relates to the importance of
having a policy with “trainable greediness” in RL scenarios. For this, we consider SOFTMAX -PQC
agents with fixed observables βOa throughout training. We observe that this has the general effect
of decreasing the performance and/or the speed of convergence of the agents. We also see that
policies with fixed high β (or equivalently, a large observable norm βkOa k) tend to have a poor
learning performance, likely due to their lack of exploration in the RL environments.
Finally, note that all the numerical simulations performed here did not include any source of noise in
the PQC evaluations. It would be an interesting research direction to assess the influence of (simulated
or hardware-induced) noise on the learning performance of PQC agents.
4 Quantum advantage of PQC agents in RL environments

The proof-of-concept experiments of the previous section show that our PQC agents can learn in basic
classical environments, where they achieve comparable performance to standard DNN policies. This
observation naturally raises the question of whether there exist RL environments where PQC policies
can provide a learning advantage over standard classical policies. In this section, we answer this
question in the affirmative by constructing: a) environments with a provable separation in learning
performance between quantum and any classical (polynomial-time) learners, and b) environments
where our PQC policies of Sec. 2 show an empirical learning advantage over standard DNN policies.
4.1 Quantum advantage of PQC policies over any classical learner
In this subsection, we construct RL environments with theoretical guarantees of separation between

quantum and classical learning agents. These constructions are predominantly based on the recent
work of Liu et al. [14], which defines a classification task out of the discrete logarithm problem
(DLP), i.e., the problem solved in the seminal work of Shor [25]. In broad strokes, this task can be
viewed as an encryption of an easy-to-learn problem. For an “un-encrypted” version, one defines
a labeling fs of integers between 0 and p − 2 (for a large prime p), where the integers are labeled
positively if and only if they lie in the segment [s, s + (p − 3)/2] (mod p − 1). Since this labeling is
linearly separable, the concept class {fs }s is then easy to learn. To make it hard, the input integers x
(now between 1 and p − 1) are first encrypted using modular exponentiation, i.e., the secure operation
performed in the Diffie–Hellman key exchange protocol. In the encrypted problem, the logarithm of
the input integer logg (x) (for a generator g of Z∗p , see Appendix F) hence determines the label of x.
Without the ability to decrypt by solving DLP, which is widely believed to be classically intractable,
the numbers appear randomly labeled. Moreover, Liu et al. show that achieving non-trivial labeling
accuracy 1/2 + 1/poly(n) (for n = log(p), i.e., slightly better than random guessing) with a classical
polynomial-time algorithm using poly(n) examples would lead to an efficient classical algorithm
that solves DLP [14]. In contrast, the same authors construct a family of quantum learners based on
Shor’s algorithm, that can achieve a labeling accuracy larger than 0.99 with high probability.
SL-DLP Our objective is to show that analogous separations between classical and quantum
learners can be established for RL environments, in terms of their attainable value functions. We start
by pointing out that supervised learning (SL) tasks (and so the classification problem of Liu et al.) can
be trivially embedded into RL environments [50]: for a given concept fs , the states x are datapoints,
an action a is an agent’s guess on the label of x, an immediate reward specifies if it was correct
(i.e., fs (x) = a), and subsequent states are chosen uniformly at random. In such settings, the value
function is trivially related to the testing accuracy of the SL problem, yielding a direct reduction of
the separation result of Liu et al. [14] to an RL setting. We call this family of environments SL-DLP.
Cliffwalk-DLP In the SL-DLP construction, we made the environment fully random in order to
simulate the process of obtaining i.i.d. samples in an SL setting. It is an interesting question whether
similar results can be obtained for environments that are less random, and endowed with temporal
structure, which is characteristic of RL. In our second family of environments (Cliffwalk-DLP),
7
we supplement the SL-DLP construction with next-state transitions inspired by the textbook “cliff
walking” environment of Sutton & Barto [44]: all states are ordered in a chain and some actions of the
agent can lead to immediate episode termination. We keep however stochasticity in the environment
by allowing next states to be uniformly sampled, with a certain probability δ (common in RL to
ensure that an agent is not simply memorizing a correct sequence of actions). This allows us to show
that, as long as sufficient randomness is provided, we still have a simple classical-quantum separation.
Deterministic-DLP In the two families constructed above, each environment instance provided
the randomness needed for a reduction from the SL problem. This brings us to the question of
whether separations are also possible for fully deterministic environments. In this case, it is clear
that for any given environment, there exists an efficient classical agent which performs perfectly
over any polynomial horizon (a lookup-table will do). However, we show in our third family of
environments (Deterministic-DLP) that a separation can still be attained by moving the randomness
to the choice of the environment itself: assuming an efficient classical agent is successful in most
of exponentially-many randomly generated (but otherwise deterministic) environments, implies the
existence of a classical efficient algorithm for DLP.
We summarize our results in the following theorem, detailed and proven in Appendices G through I.
Theorem 1. There exist families of reinforcement learning environments which are: i) fully random
(i.e., subsequent states are independent from the previous state and action); ii) partially random
(i.e., the previous moves determine subsequent states, except with a probability δ at least 0.86 where
they are chosen uniformly at random), and iii) fully deterministic; such that there exists a separation
in the value functions achievable by a given quantum polynomial-time agent and any classical
polynomial-time agent. Specifically, the value of the initial state for the quantum agent Vq (s0 ) is
ε−close to the optimal value function (for a chosen ε, and with probability above 2/3). Further, if
there exists a classical efficient learning agent that achieves a value Vc (s0 ) better than Vrand (s0 ) + ε0
(for a chosen ε0 , and with probability above 0.845), then there exists a classical efficient algorithm
to solve DLP. Finally, we have Vq (s0 ) − Vc (s0 ) larger than some constant, which depends on the
details of the environment.
The remaining point we need to address here is that the learning agents of Liu et al. do not rely on
PQCs but rather support vector machines (SVMs) based on quantum kernels [6, 7]. Nonetheless,
using a connection between these quantum SVMs and PQCs [7], we construct PQC policies which
are as powerful in solving the DLP environments as the agents of Liu et al. (even under similar noise
considerations). We state our result in the following informal theorem, that we re-state formally,
along with the details of our construction in Appendices J and K.
Theorem 2 (informal version). Using a training set of size polynomial in n = log(p) and a number
of (noisy) quantum circuit evaluations also polynomial in n, we can train a PQC classifier on the
DLP task of Liu et al. of size n that achieves a testing accuracy arbitrarily close to optimal, with high
probability. This PQC classifier can in turn be used to construct close-to-optimal quantum agents in
our DLP environments, as prescribed by Theorem 1.
4.2 Quantum advantage of PQC policies over DNN policies
While the DLP environments establish a proof of the learning advantage PQC policies can have in
theory, these environments remain extremely contrived and artificial. They are based on algebraic
properties that agents must explicitly decrypt in order to perform well. Instead, we would like to
consider environments that are less tailored to a specific decryption function, which would allow
more general agents to learn. To do this, we take inspiration from the work of Havlíček et al. [6], who,
in order to test their PQC classifiers, define a learning task generated by similar quantum circuits.
4.2.1 PQC-generated environments

We generate our RL environments out of random RAW-PQCs. To do so, we start by uniformly
sampling a RAW-PQC that uses the alternating-layer architecture of Fig. 2 for n = 2 qubits and depth
Denc = 4. We use this RAW-PQC to generate a labeling function f (s) by assigning a label +1 to the
datapoints s in [0, 2π]2 for which hZZis,θ ≥ 0 and a label −1 otherwise. We create a dataset S of
10 datapoints per label by uniformly sampling points in [0, 2π]2 for which | hZZis,θ | ≥ ∆ 2 = 0.15.
8
(a) PQC labeling function 20.0
(b) SL-PQC 20.0
(c) Cliffwalk-PQC

2π
17.5 17.5
15.0 15.0
12.5 12.5
10.0 10.0
π 7.5 7.5
5.0 5.0
2.5 2.5
softmax-PQC softmax-PQC
0.0 DNN 0.0 DNN
0 0 200 400 600 800 1000 0 200 400 600 800 1000
0 π 2π Episode Episode
Figure 5: Numerical evidence of the advantage of PQC policies over DNN policies in PQC-
generated environments. (a) Labeling function and training data used for both RL environments.
The data labels (red for +1 label and blue for −1 label) are generated using a RAW-PQC of depth
Denc = 4 with a margin ∆ = 0.3 (white areas). The training samples are uniformly sampled from
the blue and red regions, and arrows indicate the rewarded path of the cliffwalk environment. (b) and
(c) The learning curves (20 agents per curve) of randomly-initialized SOFTMAX -PQC agents and
DNN agents in RL environments where input states are (b) uniformly sampled from the dataset and
(c) follow cliffwalk dynamics. Each curve is temporally averaged with a time window of 10 episodes.
This dataset allows us to define two RL environments, similar to the SL-DLP and Cliffwalk-DLP
environments of Sec. 4.1:
• SL-PQC: this degenerate RL environment encodes a classification task in an episodic RL environ-

ment: at each interaction step of a 20-step episode, a sample state s is uniformly sampled from the
dataset S, the agent assigns a label a = ±1 to it and receives a reward δf (s),a = ±1.
• Cliffwalk-PQC: this environment essentially adds a temporal structure to SL-PQC: each episode
starts from a fixed state s0 ∈ S, and if an agent assigns the correct label to a state si , 0 ≤ i ≤ 19, it
moves to a fixed state si+1 and receives a +1 reward, otherwise the episode is instantly terminated
and the agent gets a −1 reward. Reaching s20 also causes termination.
4.2.2 Performance comparison

Having defined our PQC-generated environments, we now evaluate the performance of SOFTMAX -
PQC and DNN policies in these tasks. The particular models we consider are SOFTMAX -PQCs with
PQCs sampled from the same family as that of the RAW-PQCs generating the environments (but with
re-initialized parameters θ), and DNNs using Rectified Linear Units (ReLUs) in their hidden layers.
In our hyperparameter search, we evaluated the performance of DNNs with a wide range of depths
(number of hidden layers between 2 to 10) and widths (number of units per hidden layer between 8
and 64), and kept the architecture with the best average performance (depth 4, width 16).
Despite this hyperparametrization, we find (see Fig. 5, and Fig. 9 in Appendix E for different
environment instances) that the performance of DNN policies on these tasks remains limited compared
to that of SOFTMAX -PQCs, that learn close-to-optimal policies on both tasks. Moreover, we observe
that the separation in performance gets boosted by the cliffwalk temporal structure. This is likely do
to the increased complexity of this task, as, in order to move farther in the cliffwalk, the policy family
should allow learning new labels without “forgetting” the labels of earlier states. In these particular
case studies, the SOFTMAX -PQC policies exhibited sufficient flexibility in this sense, whereas the
DNNs we considered did not (see Appendix E for a visualization of these policies). Note that these
results do not reflect the difficulty of our tasks at the sizes we consider (a look-up table would perform
optimally) but rather highlight the inefficacy of these DNNs at learning PQC functions.
9
5 Conclusion
In this work, we have investigated the design of quantum RL agents based on PQCs. We proposed
several constructions and showed the impact of certain design choices on learning performance. In
particular, we introduced the SOFTMAX -PQC model, where a softmax policy is computed from
expectation values of a PQC with both trainable observable weights and input scaling parameters.
These added features to standard PQCs used in ML (e.g., as quantum classifiers) enhance both the
expressivity and flexibility of PQC policies, which allows them to achieve a learning performance on
benchmarking environments comparable to that of standard DNNs. We additionally demonstrated the
existence of task environments, constructed out of PQCs, that are very natural for PQC agents, but on
which DNN agents have a poor performance. To strengthen this result, we constructed several RL
environments, each with a different degree of degeneracy (i.e., closeness to a supervised learning
task), where we showed a rigorous separation between a class of PQC agents and any classical
learner, based on the widely-believed classical hardness of the discrete logarithm problem. We
believe that our results constitute strides toward a practical quantum advantage in RL using near-term
quantum devices.
6 Broad impact
We expect our work to have an overall positive societal impact. Notably, we believe that our approach
to QRL could be beneficial in the two following ways:
• Modern-day RL is known to be very resource-heavy in terms of compute power and energy

consumption (see, e.g., the resources needed to train AlphaGo Zero [17]). In other computational
problems, e.g., the quantum supremacy problem of Google [51], it was shown that, because of
their computational advantages, quantum computers could save many orders of magnitude in
energy consumption compared to classical supercomputers [52]. Therefore, a quantum learning
advantage as showcased in our work could potentially alleviate the computational demands of RL,
making it more economically appealing and environmentally-friendly.
• Aside from the game-based problems that we consider in our work, the areas of application of
RL are constantly increasing [53–55]. The learning advantages of QRL could potentially make
these existing applications more accessible technologically and economically, but also unlock new
applications, e.g., in problems in quantum information [56, 22] or quantum chemistry [57].
At the same time, our work may have certain negative consequences. Notably, QRL will inherit
many of the problems that are already present in classical RL and ML in general. For instance, it
is not clear whether the question of interpretability of learning models [58] will be negatively or
positively impacted by switching to quantum models. One could argue that the inability to fully
access the quantum Hilbert spaces in which quantum computers operate can turn learning models
even further into “black-boxes” than existing classical models. Also, similarly to the fact that current
state-of-the-art ML/RL requires supercomputers that are not accessible to everyone, private and select
access to quantum computers could emphasize existing inequalities in developing and using AI.
Acknowledgments and Disclosure of Funding
The authors would like to thank Srinivasan Arunachalam for clarifications on the testing accuracy of
their quantum classifier in the DLP classification task. The authors would also like to thank Andrea
Skolik and Arjan Cornelissen for helpful discussions and comments. CG thanks Thomas Moerland
for discussions in the early phases of this project. SJ and HJB acknowledge support from the Austrian
Science Fund (FWF) through the projects DK-ALM:W1259-N27 and SFB BeyondC F7102. SJ
also acknowledges the Austrian Academy of Sciences as a recipient of the DOC Fellowship. This
work was in part supported by the Dutch Research Council (NWO/OCW), as part of the Quantum
Software Consortium program (project number 024.003.037). VD and SM acknowledge the support
by the project NEASQC funded from the European Union’s Horizon 2020 research and innovation
programme (grant agreement No 951821). VD and SM also acknowledge partial funding by an
unrestricted gift from Google Quantum AI. The computational results presented here have been
achieved in part using the LEO HPC infrastructure of the University of Innsbruck.
10
References
[1] John Preskill. Quantum computing in the nisq era and beyond. Quantum, 2:79, 2018.
[2] Kishor Bharti, Alba Cervera-Lierta, Thi Ha Kyaw, Tobias Haug, Sumner Alperin-Lea, Abhinav
Anand, Matthias Degroote, Hermanni Heimonen, Jakob S Kottmann, Tim Menke, et al. Noisy
intermediate-scale quantum (nisq) algorithms. arXiv preprint arXiv:2101.08448, 2021.
[3] Marcello Benedetti, Erika Lloyd, Stefan Sack, and Mattia Fiorentini. Parameterized quantum
circuits as machine learning models. Quantum Science and Technology, 4(4):043001, 2019.
[4] Edward Farhi and Hartmut Neven. Classification with quantum neural networks on near term
processors. arXiv preprint arXiv:1802.06002, 2018.
[5] Maria Schuld, Alex Bocharov, Krysta M Svore, and Nathan Wiebe. Circuit-centric quantum
classifiers. Physical Review A, 101(3):032308, 2020.
[6] Vojtěch Havlíček, Antonio D Córcoles, Kristan Temme, Aram W Harrow, Abhinav Kandala,
Jerry M Chow, and Jay M Gambetta. Supervised learning with quantum-enhanced feature
spaces. Nature, 567(7747):209–212, 2019.
[7] Maria Schuld and Nathan Killoran. Quantum machine learning in feature hilbert spaces.
Physical review letters, 122(4):040504, 2019.
[8] Evan Peters, Joao Caldeira, Alan Ho, Stefan Leichenauer, Masoud Mohseni, Hartmut Neven,
Panagiotis Spentzouris, Doug Strain, and Gabriel N Perdue. Machine learning of high dimen-
sional data on a noisy quantum processor. arXiv preprint arXiv:2101.09581, 2021.
[9] Jin-Guo Liu and Lei Wang. Differentiable learning of quantum circuit born machines. Physical
Review A, 98(6):062324, 2018.
[10] Daiwei Zhu, Norbert M Linke, Marcello Benedetti, Kevin A Landsman, Nhung H Nguyen,
C Huerta Alderete, Alejandro Perdomo-Ortiz, Nathan Korda, A Garfoot, Charles Brecque, et al.
Training of quantum circuits on a hybrid quantum computer. Science advances, 5(10):eaaw9918,
2019.
[11] JS Otterbach, R Manenti, N Alidoust, A Bestwick, M Block, B Bloom, S Caldwell, N Didier,
E Schuyler Fried, S Hong, et al. Unsupervised machine learning on a hybrid quantum computer.
arXiv preprint arXiv:1712.05771, 2017.
[12] Hsin-Yuan Huang, Michael Broughton, Masoud Mohseni, Ryan Babbush, Sergio Boixo, Hart-
mut Neven, and Jarrod R McClean. Power of data in quantum machine learning. Nature
communications, 12(1):1–9, 2021.
[13] Yuxuan Du, Min-Hsiu Hsieh, Tongliang Liu, and Dacheng Tao. Expressive power of
parametrized quantum circuits. Physical Review Research, 2(3):033125, 2020.
[14] Yunchao Liu, Srinivasan Arunachalam, and Kristan Temme. A rigorous and robust quantum
speed-up in supervised machine learning. Nature Physics, 17(9):1013–1017, 2021.
[15] Ryan Sweke, Jean-Pierre Seifert, Dominik Hangleiter, and Jens Eisert. On the quantum versus
classical learnability of discrete distributions. Quantum, 5:417, 2021.
[16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G
Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.
Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
[17] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur
Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of
go without human knowledge. Nature, 550(7676):354, 2017.
[18] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ebiak, Christy
Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large
scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
[19] Piotr Mirowski, Matt Grimes, Mateusz Malinowski, Karl Moritz Hermann, Keith Anderson,
Denis Teplyashin, Karen Simonyan, Andrew Zisserman, Raia Hadsell, et al. Learning to
navigate in cities without a map. Advances in Neural Information Processing Systems, 31:
2419–2430, 2018.
[20] Samuel Yen-Chi Chen, Chao-Han Huck Yang, Jun Qi, Pin-Yu Chen, Xiaoli Ma, and Hsi-
Sheng Goan. Variational quantum circuits for deep reinforcement learning. IEEE Access, 8:
141007–141024, 2020.
[21] Owen Lockwood and Mei Si. Reinforcement learning with quantum variational circuit. In Pro-
ceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment,
volume 16, pages 245–251, 2020.
[22] Shaojun Wu, Shan Jin, Dingding Wen, and Xiaoting Wang. Quantum reinforcement learning in
continuous action space. arXiv preprint arXiv:2012.10711, 2020.
11
[23] Sofiene Jerbi, Lea M. Trenkwalder, Hendrik Poulsen Nautrup, Hans J. Briegel, and Vedran
Dunjko. Quantum enhancements for deep reinforcement learning in large spaces. PRX Quantum,
2:010328, Feb 2021.
[24] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,
and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
[25] Peter W Shor. Polynomial-time algorithms for prime factorization and discrete logarithms on a
quantum computer. SIAM review, 41(2):303–332, 1999.
[26] Manuel Blum and Silvio Micali. How to generate cryptographically strong sequences of
pseudorandom bits. SIAM journal on Computing, 13(4):850–864, 1984.
[27] Andrea Skolik, Sofiene Jerbi, and Vedran Dunjko. Quantum agents in the gym: a variational
quantum algorithm for deep q-learning. arXiv preprint arXiv:2103.15084, 2021.
[28] Adrián Pérez-Salinas, Alba Cervera-Lierta, Elies Gil-Fuster, and José I Latorre. Data re-
uploading for a universal quantum classifier. Quantum, 4:226, 2020.
[29] Daoyi Dong, Chunlin Chen, Hanxiong Li, and Tzyh-Jong Tarn. Quantum reinforcement learning.
IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38(5):1207–1220,
2008.
[30] Giuseppe Davide Paparo, Vedran Dunjko, Adi Makmal, Miguel Angel Martin-Delgado, and
Hans J Briegel. Quantum speedup for active learning agents. Physical Review X, 4(3):031002,
2014.
[31] Vedran Dunjko, Jacob M Taylor, and Hans J Briegel. Quantum-enhanced machine learning.
Physical review letters, 117(13):130501, 2016.
[32] Daniel Crawford, Anna Levit, Navid Ghadermarzy, Jaspreet S Oberoi, and Pooya Ronagh. Rein-
forcement learning using quantum boltzmann machines. Quantum Information & Computation,
18(1-2):51–74, 2018.
[33] Florian Neukart, David Von Dollen, Christian Seidel, and Gabriele Compostella. Quantum-
enhanced reinforcement learning for finite-episode games with discrete state spaces. Frontiers
in Physics, 5:71, 2018.
[34] Lov K Grover. A fast quantum mechanical algorithm for database search. In Proceedings of the
twenty-eighth annual ACM symposium on Theory of computing, pages 212–219, 1996.
[35] Mark W Johnson, Mohammad HS Amin, Suzanne Gildert, Trevor Lanting, Firas Hamze, Neil
Dickson, Richard Harris, Andrew J Berkley, Jan Johansson, Paul Bunyk, et al. Quantum
annealing with manufactured spins. Nature, 473(7346):194–198, 2011.
[36] TensorFlow Quantum. Parametrized quantum circuits for reinforcement learning.
URL : tensorflow.org/quantum/tutorials/quantum_reinforcement_learning, 2021.
[37] Michael Broughton, Guillaume Verdon, Trevor McCourt, Antonio J Martinez, Jae Hyeon Yoo,
Sergei V Isakov, Philip Massey, Murphy Yuezhen Niu, Ramin Halavati, Evan Peters, et al.
Tensorflow quantum: A software framework for quantum machine learning. arXiv preprint
arXiv:2003.02989, 2020.
[38] Michael A. Nielsen and Isaac L. Chuang. Quantum Computation and Quantum Information.
Cambridge University Press, 2000.
[39] Ronald De Wolf. Quantum computing: Lecture notes. arXiv preprint arXiv:1907.09415, 2019.
[40] Abhinav Kandala, Antonio Mezzacapo, Kristan Temme, Maika Takita, Markus Brink, Jerry M
Chow, and Jay M Gambetta. Hardware-efficient variational quantum eigensolver for small
molecules and quantum magnets. Nature, 549(7671):242–246, 2017.
[41] Maria Schuld, Ryan Sweke, and Johannes Jakob Meyer. Effect of data encoding on the
expressive power of variational quantum-machine-learning models. Physical Review A, 103(3):
032430, 2021.
[42] Maria Schuld, Ville Bergholm, Christian Gogolin, Josh Izaac, and Nathan Killoran. Evaluating
analytic gradients on quantum hardware. Physical Review A, 99(3):032331, 2019.
[43] Lilian Weng. Policy gradient algorithms. URL : lilianweng.github.io/lil-log, 2018.
[44] Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. 1998.
[45] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-
ment learning. Machine learning, 8(3-4):229–256, 1992.
[46] Kosuke Mitarai, Makoto Negoro, Masahiro Kitagawa, and Keisuke Fujii. Quantum circuit
learning. Physical Review A, 98(3):032309, 2018.
[47] Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for
gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):
1471–1530, 2004.
12
[48] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep
reinforcement learning for continuous control. In International conference on machine learning,
pages 1329–1338. PMLR, 2016.
[49] OpenAI. Leaderboard of openai gym environments. URL : github.com/openai/gym/wiki, 2020.
[50] Vedran Dunjko, Yi-Kai Liu, Xingyao Wu, and Jacob M Taylor. Exponential improvements for
quantum-accessible reinforcement learning. arXiv preprint arXiv:1710.11160, 2017.
[51] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C Bardin, Rami Barends, Rupak
Biswas, Sergio Boixo, Fernando GSL Brandao, David A Buell, et al. Quantum supremacy using
a programmable superconducting processor. Nature, 574(7779):505–510, 2019.
[52] Benjamin Villalonga, Dmitry Lyakh, Sergio Boixo, Hartmut Neven, Travis S Humble, Rupak
Biswas, Eleanor G Rieffel, Alan Ho, and Salvatore Mandrà. Establishing the quantum supremacy
frontier with a 281 pflop/s simulation. Quantum Science and Technology, 5(3):034003, 2020.
[53] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.
The International Journal of Robotics Research, 32(11):1238–1274, 2013.
[54] Mufti Mahmud, Mohammed Shamim Kaiser, Amir Hussain, and Stefano Vassanelli. Appli-
cations of deep learning and reinforcement learning to biological data. IEEE transactions on
neural networks and learning systems, 29(6):2063–2079, 2018.
[55] Chao Yu, Jiming Liu, and Shamim Nemati. Reinforcement learning in healthcare: A survey.
arXiv preprint arXiv:1908.08796, 2019.
[56] Francisco Albarrán-Arriagada, Juan C Retamal, Enrique Solano, and Lucas Lamata.
Measurement-based adaptation protocol with quantum reinforcement learning. Physical Review
A, 98(4):042315, 2018.
[57] Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J
Love, Alán Aspuru-Guzik, and Jeremy L O’brien. A variational eigenvalue solver on a photonic
quantum processor. Nature communications, 5(1):1–7, 2014.
[58] Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable ai: A review
of machine learning interpretability methods. Entropy, 23(1):18, 2021.
[59] Takahiro Goto, Quoc Hoan Tran, and Kohei Nakajima. Universal approximation property
of quantum machine learning models in quantum-enhanced feature spaces. Physical Review
Letters, 127(9):090506, 2021.
[60] Adrián Pérez-Salinas, David López-Núñez, Artur García-Sáez, Pol Forn-Díaz, and José I
Latorre. One qubit as a universal approximant. Physical Review A, 104(1):012405, 2021.
[61] Google. Cirq: A python framework for creating, editing, and invoking noisy intermediate scale
quantum circuits. URL : github.com/quantumlib/Cirq, 2018.
[62] Yasunari Suzuki, Yoshiaki Kawase, Yuya Masumura, Yuria Hiraga, Masahiro Nakadai, Jiabao
Chen, Ken M Nakanishi, Kosuke Mitarai, Ryosuke Imai, Shiro Tamiya, et al. Qulacs: a fast
and versatile quantum circuit simulator for research purpose. arXiv preprint arXiv:2011.13524,
2020.
13
Supplementary Material for:
Parametrized Quantum Policies for Reinforcement Learning
Sofiene Jerbi, Casper Gyurik, Simon C. Marshall, Hans J. Briegel, Vedran Dunjko
Outline The Supplementary Material is organized as follows. In Appendix A, we derive the

expression of the log-policy gradient for SOFTMAX -PQCs presented in Lemma 1. In Appendix B,
we prove Lemmas 2 and 3 on the efficient policy sampling and the efficient estimation of the log-
policy gradient for SOFTMAX -PQC policies. In Appendix C, we clarify the role of the trainable
observables in our definition of SOFTMAX -PQC policies. In Appendix D, we give a specification of
the environments considered in our numerical simulations, as well the hyperparameters we used to
train all RL agents. In Appendix E, we present additional plots and numerical simulations that help
our understanding and visualization of PQC polices. In Appendix F, we give a succinct description
of the DLP classification task of Liu et al. In Appendices G to I, we prove our main Theorem 1 on
learning separations in DLP environments. In appendix J, we construct PQC agents with provable
guarantees of solving the DLP environments, stated and proven in Theorem 3.
A Derivation of the log-policy gradient

For a SOFTMAX -PQC defined in Def. 1, we have:
X
∇θ log πθ (a|s) = ∇θ log eβhOa is,θ − ∇θ log eβhOa0 is,θ
a0
Xe βhOa0 is,θ
β∇θ hOa0 is,θ
= β∇θ hOa is,θ − P
a0 a00 eβhOa00 is,θ
!
X
0
=β ∇θ hOa is,θ − πθ (a |s)∇θ hOa0 is,θ .
a0
B Efficient implementation of SOFTMAX -PQC policies

B.1 Efficient approximate policy sampling
In this section we prove Lemma 2, restated below:

Lemma 2. For a SOFTMAX -PQC policy πθ defined by a unitary U (s, θ) and observables Oa ,
call hOfa is,θ approximations of the true expectation values hOa i with at most ε additive error.
s,θ
Then the approximate policy π eθ = softmaxβ (hOfa is,θ ) has total variation distance O(βε) to πθ =
softmaxβ (hOa is,θ ). Since expectation values can be efficiently estimated to additive error on a
quantum computer, this implies efficient approximate sampling from πθ .
n o
fa is,θ
Proof. Consider |A| estimates hO , obtained all to additive error ε, i.e.,
1≤a≤|A|

f
hOa is,θ − hOa is,θ ≤ ε, ∀a
and used to compute an approximate policy

eβhOa is,θ
f
eθ (a|s) = P
π g0 is,θ
.
a0 eβhO a
14
Due to the monoticity of the exponential, we have, for all a:
e−βε eβhOa is,θ eβhOa is,θ eβε eβhOa is,θ
f
P i
≤ P ≤ P
βhO
eβε a0 e a0 s,θ a0 e
βhOa0 is,θ
g
e−βε a0 eβhOa0 is,θ
⇔ e−2βε πθ (a|s) ≤ eθ (a|s)
π ≤ e2βε πθ (a|s). (7)
Hence,
X
eθ ) =
TV(πθ , π |e
πθ (a|s) − πθ (a|s)|
a
X
≤ e2βε πθ (a|s) − e−2βε πθ (a|s)
a
X
= e2βε − e−2βε πθ (a|s)
a

= 2|sinh(2βε)| = 4βε + O (βε)3 ,
βε→0+
πθ (a|s), πθ (a|s)} ∈ [e−2βε πθ (a|s), e2βε πθ (a|s)] in the first inequality.

where we used {e
B.2 Efficient estimation of the log-policy gradient
Using a similar approach to the proof of the previous section, we show the following lemma:
Lemma 3. For a SOFTMAX -PQC policy πθ defined by a unitary U (s, θ) and observables Oa , call
fa is,θ approximations of the true derivatives ∂i hOa is,θ with at most ε additive error, and hO
∂i hO fa is,θ
approximations of the true expectation values hOa is,θ with at most ε = ε(4β maxa kOa k)−1
0
additive error. Then the approximate log-policy gradient ∇θ log π fθ (a|s) = β ∇θ hO fa is,θ −
P
0 π
g
fθ (a0 |s)∇θ hOa0 is,θ has distance O(βε) to ∇θ log πθ (a|s) in `∞ -norm.
a
Proof. Call xa,i = πθ (a|s)∂i hOa is,θ and x fa is,θ , such that:
eθ (a|s)∂i hO
ea,i = π
X
fθ (a|s) = β ∂i hO
∂i log π fa is,θ − ea0 ,i .
x
a0
and similarly for ∂i log πθ (a|s).

fa is,θ | ≤ ε, ∀a, i, we have:
Using Eq. (7) and that |∂i hOa is,θ − ∂i hO
0
e−2βε πθ (a|s) (∂i hOa is,θ − ε) ≤ π fa is,θ ≤ e2βε0 πθ (a|s) (∂i hOa is,θ + ε)
eθ (a|s)∂i hO
! !
X X X
−2βε0 2βε0
⇒ e xa,i − ε ≤ ea,i
x ≤e xa,i + ε
a a a
where we summed the first inequalities over all a. Hence:

! !
X X
2βε0 X −2βε0
X
xa,i − ea,i ≤ e
x xa,i + ε − e xa,i − ε

a a a a

0 X

0 0 0
≤ (e2βε + e−2βε )ε + (e2βε − e−2βε ) xa,i

a
X

≤ 2 cosh(2βε0 )ε + 2 sinh(2βε0 ) xa,i
a

X
0 0 2 0 3
= ε + 4βε xa,i + O (βε ) ε + O (βε ) . (8)
βε0 →0+
a
We also have

X X

xa,i = πθ (a|s)∂i hOa is,θ ≤ max |∂i hOa is,θ | ≤ max kOa k
a a a,i a
15
where the last inequality derives from the parameter-shift rule (Eq. (5)) formulation of ∂i hOa i for
derivatives w.r.t. rotation angles of the PQC and the fact that ∂i hOa i are simply expectation values
hHa,i i with kHa,i k ≤ kOa k for observable weights.
Applying the triangular inequality on the right side of Eq. (8), we hence have:

X X

xa,i − ea,i ≤ ε + 4βε0 max kOa k + O (βε0 )2 ε + O (βε0 )3 .
x
a a
βε0 →0+ a
fa is,θ | ≤ ε, ∀a, i, we finally have:

For ε0 = ε(4β maxa kOa k)−1 and using |∂i hOa is,θ − ∂i hO
|∂i log πθ (a|s) − ∂i log π
fθ (a|s)| ≤ 3βε + O(βε3 ) ∀i
βε→0+
C The role of trainable observables in SOFTMAX -PQC policies

P
In Sec. 2.2, we presented a general definition of the SOFTMAX -PQC observables Oa = i wa,i Ha,i
in terms of an arbitrary weighted sum of Hermitian matrices Ha,i . In this appendix, we clarify the
role of such a decomposition.
C.1 Training the eigenbasis and the eigenvalues of an observable

P
Consider a projective measurement defined by an observable O = m αm Pm , to be performed on a
quantum state of the form V (θ) |ψi, where V (θ) denotes a (variational) unitary. Equivalently, one
could also measure the observable V † (θ)OV (θ) on the state |ψi. Indeed, these two measurements
have the same probabilities p(m) = hψ| V † (θ)Pm V (θ) |ψi of measuring any outcome αm . Note
also that the possible outcomes αm (i.e., the eigenvalues of the observable O) remain unchanged.
P
From this observation, it is then clear that, by defining an observable O = m αm Pm using projec-
tions Pm on each computational basis state of the Hilbert space H and arbitrary eigenvalues αm ∈ R,
the addition of a universal variational unitary V (θ) prior to the measurement results in a family of
observables {V † (θ)OV (θ)}θ,α that covers all possible Hermitian observables in H. Moreover, in
this setting, the parameters that define the eigenbasis of the observables V † (θ)OV (θ) (i.e., θ) are
completely distinct from the parameters that define their eigenvalues (i.e., α). This is not the case for
observables that are expressed as linear combinations of non-commuting matrices, for instance.
In our simulations, we consider restricted families of observables. In particular, we take the Hermitian
matrices Ha,i to be diagonal in the computational basis (e.g., tensor products of Pauli-Z matrices),
which means they, as well as Oa , can be decomposed in terms of projections on the computational ba-
sis states. However, the resulting eigenvalues α that we obtain from this decomposition are in our case
degenerate, which means that the weights wa underparametrize the spectrums of the observables Oa .
Additionally, the last variational unitaries Vvar (φL ) of our PQCs are far from universal, which restricts
†
the accessible eigenbasis of all variational observables Vvar (φL )Oa Vvar (φL ).
C.2 The power of universal observables
Equivalently to the universal family of observables {V † (θ)OV

P(θ)}θ,α that we defined in the previous
section, one can construct a family of observables {Ow = i wi Hi }w that parametrizes all Hermi-
tian matrices in H (e.g., by taking Hi to be single components of a Hermitian matrix acting on H).
Note that this family is covered by our definition of SOFTMAX -PQC observables. Now, given access
to data-dependent quantum states |ψs i that are expressive enough (e.g., a binary encoding of the input
s, or so-called universal quantum feature states [59]), one can approximate arbitrary functions of s
using expectations values of the form hψs | Ow |ψs i. This is because the observables Ow can encode
an arbitrary quantum computation. Hence, in the case of our SOFTMAX -PQCs, one could use such
observables and such encodings |ψs i of the input states s to approximate any policy π(a|s) (using an
additional softmax), without the need for any variational gates in the PQC generating |ψs i.
As we mentioned in the previous section, the observables that we consider in this work are more
restricted, and moreover, the way we encode the input states s leads to non-trivial encodings |ψs,φ,λ i
in general. This implies that the variational parameters φ, λ of our PQCs have in general a non-trivial
role in learning good policies. One can even show here that these degrees of freedom are sufficient to
make such PQCs universal function approximators [60].
16
D Environments specifications and hyperpameters
In Table 1, we present a specification of the environments we consider in our numerical simulations.
These are standard benchmarking environments from the OpenAI Gym library [24], described in Ref.
[49], PQC-generated environments that we define in Sec. 4.2, and the CognitiveRadio environment
of Ref. [20] that we discuss in Appendix E.
State Number of
Environment Horizon Reward function Termination conditions
dimension actions
Pole angle or cart position

CartPole-v1 4 2 500 +1 until termination outside of bounds
Reaching horizon
−1 + height
MountainCar-v0 2 3 200 Reaching goal or horizon
until termination
Acrobot-v1 6 3 500 −1 until termination Reaching goal or horizon
+1 for good action

SL-PQC 2 2 20 Reaching horizon
−1 for wrong action
+1 for good action Doing wrong action

Cliffwalk-PQC 2 2 20
−1 for wrong action Reaching horizon
2 to 5 +1 for good action

CognitiveRadio 2 to 5 100 Reaching horizon
(discrete) −1 for wrong action
Table 1: Environments specifications. The reward function of Mountaincar-v0 has been modified
compared to the standard specification of OpenAI Gym [24], similarly to Ref. [48].
In Tables 2 and 3, we list the hyperparameters used to train our agents on the various environments
we consider. All agents use an ADAM optimizer. For the plots presented in this manuscript, all
quantum circuits were implemented using the Cirq library [61] in Python and simulated using a
Qulacs backend [62] in C++. For the tutorial [36], the TensorFlow Quantum library [37] was used.
All simulations were run on the LEO cluster (more than 3000 CPUs) of the University of Innsbruck,
with an estimated total compute time (including hyperparametrization) of 20 000 CPU-hours.
E Deferred plots and shape of policies learned by PQCs v.s. DNNs

E.1 Influence of architectural choices on RAW-PQC agents
In Fig. 6, we run a similar experiment to that of Sec. 3.2 in the main text, but on RAW-PQC agents
instead of SOFTMAX -PQC agents. We observe that both increasing the depth of the PQCs and
training the scaling parameters λ have a similar positive influence on the learning performance, and
even more pronounced than for SOFTMAX -PQC agents. Nonetheless, we also observe that, even
at greater depth, the final performance, as well as the speed of convergence, of RAW-PQC agents
remain limited compared to that of SOFTMAX -PQC agents.
500
CartPole - raw-PQC -60
MountainCar - raw-PQC -100
Acrobot - raw-PQC
-150
-80
400
-200
-100
300 -250
-120 -300
200 -350
-140
Depth 5 Depth 6 -400 Depth 5
100
Reference (depth 1) -160 Reference (depth 4) -450 Reference (depth 2)
Fixed lambdas Fixed lambdas Fixed lambdas
0 -180 -500
0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000
Figure 6: Influence of the model architecture for RAW-PQC agents. The blue curves in each plot
correspond to the learning curves from Fig. 3 and are taken as a reference.
17
Learning Discount Final Batch
Environment Model Depth Width
rates γ β size
softmax-PQC [0.01, 0.1, 0.1] 1 1 10 {1, 5} 4

CartPole-v1
raw-PQC [0.01, 0., 0.1] 1 ✗ 10 {1, 5} 4
softmax-PQC [0.01, 0.1, 0.01] 1 1.5 10 {4, 6} 2

MountainCar-v0
raw-PQC [0.01, 0., 0.01] 1 ✗ 10 {4, 6} 2
softmax-PQC [0.01, 0.1, 0.1] 1 1 10 {2, 5} 6

Acrobot-v1
raw-PQC [0.01, 0., 0.1] 1 ✗ 10 {2, 5} 6
softmax-PQC [0.01, 0.1, 0.01] 0.9 1 10 4 2

SL-PQC
DNN 0.01 0.9 1 10 4 16
softmax-PQC [0.01, 0.1, 0.1] 0.9 1 10 4 2

Cliffwalk-PQC
DNN 0.01 0.9 1 10 4 16
CognitiveRadio softmax-PQC [0.01, 0.1, 0.1] 0.9 1 1 3 2 to 5
Table 2: Hyperparmeters 1/2. For PQC policies, we choose 3 distinct learning rates [αφ , αw , αλ ]
for rotation angles φ, observable weights w and scaling parameters λ, respectively. For SOFTMAX -
PQCs, we take a linear annealing schedule for the inverse temperature parameter β starting from 1
and ending up in the final β. The batch size is counted in number of episodes used to evaluate the
gradient of the value function. Depth indicates the number of encoding layers Denc for PQC policies,
or the number of hidden layers for a DNN policy. Width corresponds to the number of qubits n on
which acts a PQC (also equal to the dimension d of the environment’s state space), or the number of
units per hidden layer for a DNN.
Number
Entang. Train
Environment Model Observables of Baseline
topology entang.
params.
softmax-PQC All-to-all Yes [wZ0 Z1 Z2 Z3 , (− . . .)] {31, 119} No

CartPole-v1
raw-PQC All-to-all Yes [Z0 Z1 Z2 Z3 , (− . . .)] {30, 118} No
softmax-PQC One-to-one No [w0 Z0 , w1 Z0 Z1 , w2 Z1 ] {39, 55} Yes

MountainCar-v0
raw-PQC One-to-one No [P0,1 , P2 , P3 ] {36, 52} Yes

softmax-PQC Circular Yes wi · (Z0 , . . . , Z5 )T 1≤i≤3 {90, 180} Yes
Acrobot-v1
raw-PQC Circular Yes [P0..21 , P22..42 , P43..63 ] {72, 162} Yes
softmax-PQC One-to-one No [wZ0 Z1 , (− . . .)] 37 No

SL-PQC
DNN ✗ ✗ ✗ 902 No
softmax-PQC One-to-one No [wZ0 Z1 , (− . . .)] 37 No

Cliffwalk-PQC
DNN ✗ ✗ ✗ 902 No
CognitiveRadio softmax-PQC Circular No [w0 Z0 , w1 Z1 , . . . , wn Zn ] 30 to 75 No
Table 3: Hyperparmeters 2/2. We call entangling layer a layer of 2-qubit gates in the PQC. Circular
and all-to-all topologies of entangling layers are equivalent for n = 2 qubits, so we call them one-
to-one in that case. When trained, entangling layers are composed of Rzz = e−iθ(Z⊗Z)/2 rotations,
otherwise, they are composed of Ctrl-Z gates. For policies with 2 actions, the same observable, up
to a sign change, is used for both actions. Zi refers to a Pauli-Z observable acting on qubit i, while
Pi..j indicates a projection on basis states i to j. In the experiments of Sec. 3.2, when the weights
of the SOFTMAX -PQC are kept fixed, the observables used for MountainCar-v0 and Acrobot-v1
are [Z0 , Z0 Z1 , Z1 ], and those used for CartPole-v1 are [Z0 Z1 Z2 Z3 , −Z0 Z1 Z2 Z3 ]. The different
number of parameters in a given row correspond to the different depths in that same row in Table 2.
18
E.2 Shape of the policies learned by PQCs v.s. DNNs
In CartPole-v1 The results of the Sec. 3 demonstrate that our PQC policies can be trained to good
performance in benchmarking environments. To get a feel of the solutions found by our agents, we
compare the SOFTMAX -PQC policies learned on CartPole to those learned by standard DNNs (with
a softmax output layer), which are known to easily learn close-to-optimal behavior on this task. More
specifically, we look at the functions learned by these two models, prior to the application of the
softmax normalization function (see Eq. (3)). Typical instances of these functions are depicted in
Figure 8. We observe that, while DNNs learn simple, close to piece-wise linear functions of their
input state space, PQCs tend to naturally learn very oscillating functions that are more prone to
instability. While the results of Schuld et al. [42] already indicated that these highly oscillating
functions would be natural for PQCs, it is noteworthy to see that these are also the type of functions
naturally learned in a direct-policy RL scenario. Moreover, our enhancements to standard PQC
classifiers show how to make these highly oscillating functions more amenable to real-world tasks.
In PQC-generated environments Fig. 9 shows the analog results to Fig. 5 in the main text but
with two different random initializations of the environment-generating PQC. Both confirm our
observations. In Fig. 10, we compare the policies learned by prototypical SOFTMAX -PQC and DNN
agents in these PQC-generated environments. We observe that the typical policies learned by DNNs
are rather simple, with up to 2 (or 3) regions, delimited by close-to-linear boundaries, as opposed to
the policies learned by SOFTMAX -PQCs, which delimit red from blue regions with wide margins.
These observations highlight the inherent flexibility of SOFTMAX -PQC policies and their suitability
to these PQC-generated environments, as opposed to the DNN (and RAW-PQC) policies we consider.
E.3 Additional numerical simulation on the CognitiveRadio environment
In a related work on value-based RL with PQCs, the authors of Ref. [20] introduced the CognitiveRa-
dio environment as a benchmark to test their RL agents. In this environment, the agent is presented at
each interaction step with a binary vector (0, 0, 0, 1, 0) of size n that describes the occupation of n
radio channels. Given this state, the agent must select one of the n channels as its communication
channel, such as to avoid collision with occupied channels (a ±1 reward reflects these collisions). The
authors of Ref. [20] consider a setting where, in any given state, only one channel is occupied, and
its assignment changes periodically over time steps, for an episode length of 100 steps. While this
constitutes a fairly simple task environment with discrete state and action spaces, it allows to test the
performance of PQC agents on a family of environments described by their system size n and make
claims on the parameter complexity of the PQCs as a function of n. As to reproduce the findings of
Ref. [20] in a policy-gradient setting, we test the performance of our SOFTMAX -PQC agents on this
environment. We find numerically (see Fig. 7) that these achieve a very similar performance to the
PQC agents of Ref. [20] on the same system sizes they consider (n = 2 to 5), using PQCs with the
same scaling of number of parameters, i.e., O(n).
100
2 channels 100
3 channels 100
4 channels 100
5 channels
80 80 80 80
60 60 60 60
40 40 40 40
20 20 20 20
softmax-PQC softmax-PQC softmax-PQC softmax-PQC

0 0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Episode Episode Episode Episode
Figure 7: Performance of our SOFTMAX -PQC agents on the CognitiveRadio environment pro-
posed in Ref. [20]. Average performance of 20 agents for system sizes (and number of qubits) n = 2
to 5.
19
y
y
Unnormalized probabilit
2 8
4
6
2 1 4
0 0 2
-1 0
-2 -2
-2 -4
-4
-3 -6
2 2 2
ty
ci
1 1 1
lo
on
y
it
ve
ti
c
-0.4 0 -0.4 0 -0.4 0
lo
si
r
la
-0.2 -0.2 -0.2
po
ve
-1 -1 -1
gu
Pole 0.0 Pole 0.0 Pole 0.0
t
t
ar
ar
an
ang 0.2 -2 ang 0.2 -2 C ang 0.2 -2
C
le le le
le
0.4 0.4 0.4
Po
(a)
y
Unnormalized probabili
6
10
4
20
2 5
10
0 0
0
-2 -5 -10
-10 -20
2
ty
1 -2 -0.4 -0.4
Po -2
on
-0.2 le -1 -0.2
ti
-0.4 0 Ca -1 an
si
-0.2 rt 0.0 gle 0.0 gle

po
-1 ve 0 gu 0
Pole 0.0 loc 1 an lar an
t
0.2 ole ve 1 0.2 ole

ar
ang 0.2 -2 ity P loc 2 P

C
le 2 0.4 0.4
0.4 ity
(b)
Figure 8: Prototypical unnormalized policies learned by SOFTMAX -PQC agents and DNN
agents in CartPole. Due to the 4 dimensions of the state space in CartPole, we represent the
unnormalized policies learned by (a) SOFTMAX -PQC agents and (b) DNN agents on 3 subspaces of
the state space by fixing unrepresented dimensions to 0 in each plot. To get the probability of the
agent pushing the cart to the left, one should apply the logistic function (i.e., 2-dimensional softmax)
1/(1 + exp(−z)) to the z-axis values of each plot.
20
(b) SL-PQC 20.0
(c) Cliffwalk-PQC

2π
17.5 17.5
15.0 15.0
12.5 12.5
10.0 10.0
π 7.5 7.5
5.0 5.0
2.5 2.5
raw-PQC raw-PQC
0.0 DNN 0.0 DNN
0 0 200 400 600 800 1000 0 200 400 600 800 1000
(a) (b) (c)

(b) SL-PQC 20.0
(c) Cliffwalk-PQC
2π
17.5 17.5
15.0 15.0
12.5 12.5
10.0 10.0
π 7.5 7.5
5.0 5.0
2.5 2.5
raw-PQC raw-PQC
0.0 DNN 0.0 DNN
0 0 200 400 600 800 1000 0 200 400 600 800 1000
(d) (e) (f)

Figure 9: Different random initializations of PQC-generated environments and their associated
learning curves. See Fig. 5 for details. The additional learning curves (20 agents per curve) of
randomly-initialized RAW-PQC agents highlight the hardness of these environments for PQC policies
drawn from the same family as the environment-generating PQCs.
DNN policy softmax-PQC policy

2π 2π
π π
0 π 0 π
0 2π 0 2π
(a) (b)
DNN policy softmax-PQC policy

2π 2π
π π
0 π 0 π
0 2π 0 2π
(c) (d)
Figure 10: Prototypical policies learned by SOFTMAX -PQC agents and DNN agents in PQC-
generated environments. All policies are associated to the labeling function of Fig. 9.d. Policies
(a) and (b) are learned in the SL-PQC environment while policies (c) and (d) are learned in the
Cliffwalk-PQC environment.
21
F Supervised learning task of Liu et al.
Define p a large prime number, n = dlog2 (p − 1)e, and g a generator of Z∗p = {1, 2, . . . , p − 1} (i.e.,
a g ∈ Z∗p such that {g y , y ∈ Zp−1 } = Z∗p ). The DLP consists in computing logg x on input x ∈ Z∗p .
Based on DLP, Liu et al. [14] define a concept class C = {fs }s∈Zp−1 over the input space X = Z∗p ,
where each labeling function of this concept class is defined as follows:

+1, if logg x ∈ [s, s + p−32 ],
fs (x) = (9)
−1, otherwise.
Each function fs : Z∗p → {−1, 1} hence labels half the elements in Z∗p with a label +1 and the other
half with a label −1. We refer to Figure 1 in Ref. [14] for a good visualization of all these objects.
The performance of a classifier f is measured in terms of its testing accuracy
Accf (fs ) = Prx∼X [f (x) = fs (x)].
G Proof of Theorem 1
In the following, we provide constructions of a) fully random, b) partially random and c) fully
deterministic environments satisfying the properties of Theorem 1. We consider the three families of
environments separately and provide individual lemmas specifying their exact separation properties.
Fully random: the SL-DLP environment. This result is near-trivially obtained by noting that any
classification problem can be easily mapped to a (degenerate) RL problem. For this, the environment
will be an MDP defined as follows: its state space is the input space of the classification problem,
its action space comprises all possible labels, rewards are trivially +1 for assigning a correct label
to an input state and −1 otherwise, and the initial and next-state transition probabilities are state-
independent and equal to the input distribution of the classification task. The optimal policy of this
MDP is clearly the optimal classifier of the corresponding SL task. Consider now the classification
task of Liu et al., defined in detail in Appendix F: the input distribution is taken to be uniform on the
1
state space, i.e., P (st ) = |S| , and the performance of a classifier f w.r.t. a labeling (or ground truth)
∗
function f is measured in terms of a testing accuracy
1 X
Accf (f ∗ ) = Pr[f (s) = f ∗ (s)]. (10)
|S| s
For the MDP associated to this classification task and length-1 episodes of interaction, the value
function of any policy π(a|s) is given by
1 X
Vπ (s0 ) = (π(f ∗ (s0 )|s0 ) − π(−f ∗ (s0 )|s0 ))
|S| s
0
1 X
= 2π(f ∗ (s0 )|s0 ) − 1
|S| s
0
= 2Accπ (f ∗ ) − 1,
which is trivially related to the testing accuracy of this policy on the classification task. Note that we
also have Vrand (s0 ) = 0 and Vopt (s0 ) = 1.
Since these observations hold irrespectively of the labeling function f ∗ , we can show the following
result:
Lemma 4 (Quantum advantage in SL-DLP). There exists a uniform family of SL-DLP MDPs, each
derived from a labeling function f ∗ of the DLP concept class C (see Appendix F), for which classical
hardness and quantum learnability holds. More specifically, the performance of any classical learner
is upper bounded by 1/poly(n), while that of a class of quantum agents is lower bounded by 0.98
with probability above 2/3 (over the randomness of their interaction with the environment and noise
in their implementation).
Proof. Classical hardness is trivially obtained by contraposition: assuming no classical polynomial-

time algorithm can solve DLP, then using Theorem 1 of Liu et al., any classical policy would have
22
testing accuracy Accπ (f ∗ ) ≤ 1/2 + 1/poly(n), and hence its value function would be Vπ (s0 ) ≤
1/poly(n).
For quantum learnability, we define an agent that first collects poly(n) random length-1 interactions
(i.e., a random state s0 and its associated reward for an action +1, from which the label f ∗ (s0 ) can be
inferred), and use Theorem 2 of Liu et al. to train a classifier that
has test accuracy at least 0.99 with
probability at least 2/3 (this process can be repeated O log δ −1 times to increase this probability
to 1 − δ via majority voting). This classifier has a value function Vπ (s0 ) ≥ 0.98.
Note that this proof trivially generalizes to episodes of interaction with length greater than 1, when
preserving the absence of temporal correlation in the states experienced by the agents. For episodes of
length H, the only change is that the value function of any policy, and hence the bounds we achieve,
H
get multiplied by a factor of 1−γ
1−γ for a discount factor γ < 1 and by a factor H for γ = 1.
Partially random: the Cliffwalk-DLP environment. One major criticism to the result of Lemma
4 is that it applies to a very degenerate, fully random RL environment. In the following, we show
that similar results can be obtained in environments based on the same classification problem, but
while imposing more temporal structure and less randomness (such constructions were introduced
in Ref. [50], but for the purpose of query separations between RL and QRL). For instance, one can
consider cliffwalk-type environments, inspired by the textbook “cliff walking” environment of Sutton
& Barto [44]. This class of environments differs from the previous SL-DLP environments in its state
and reward structure: in any episode of interaction, experienced states follow a fixed “path” structure
(that of the cliff) for correct actions, and a wrong action yields to immediate “death” (negative reward
and episode termination). We slightly modify this environment to a “slippery scenario” in which,
with a δ probability, any action may lead to a uniformly random position on the cliff. This additional
randomness allows us to prove the following separation:
Lemma 5 (Quantum advantage in Cliffwalk-DLP). There exists a uniform family of Cliffwalk-DLP
MDPs with arbitrary slipping probability δ ∈ [0.86, 1] and discount factor γ ∈ [0, 0.9], each derived
from a labeling function f ∗ of the DLP concept class C, for which classical hardness and quantum
learnability holds. More specifically, the performance of any classical learner is upper bounded by
Vrand (s0 ) + 0.1, while that of a class of quantum agents is lower bounded by Vopt (s0 ) − 0.1 with
probability above 2/3 (over the randomness of their interaction with the environment and noise in
their implementation). Since Vrand (s0 ) ≤ − 12 and Vopt = 0, we always have a classical-quantum
separation.
The proof of this lemma is deferred to Appendix H for clarity.
Fully deterministic: the Deterministic-DLP environment. The simplest example of a determin-

istic RL environment where separation can be proven is a partially observable MDP (POMDP)
defined as follows: it constitutes a 1-D chain of states of length k + 2, where k is poly(n). We
refer to the first k states as “training states", and we call the last two states “test” and “limbo” states,
respectively. The training states are of the form (x, fs (x)), i.e., a point uniformly sampled and its
label. The actions are +1, −1, and both lead to the same subsequent state on the chain (since the
same (x, fs (x)) can appear twice in the chain, this is the reason why the environment is partially
observable), and no reward is given for the first k states. In the test state, the agent is only given
a point x with no label. A correct action provides a reward of 1 and leads to the beginning of the
chain, while an incorrect action leads to the limbo state, which self-loops for both actions and has no
rewards. In other words, after poly-many examples where the agent can learn the correct labeling, it
is tested on one state. Failure means it will never obtain a reward.
For each concept fs , we define exponentially many environments obtained by random choices of the
states appearing in the chain. In a given instance, call T = (x0 , . . . , xk−1 ) the training states of that
instance, xk its testing state and l its limbo state. The interaction of an agent with the environment is
divided into episodes of length k + 1, but the environment keeps memory of its state between episodes.
This means that, while the first episode starts in x0 , depending on the performance of the agent, later
episodes start either in x0 or in l. For a policy π, we define the value Vπ (s0 ) as the expected reward2
2
Note that we assume here a discount factor γ = 1, but our results would also hold for an arbitrary γ > 0, if
we scale the reward of the testing state to γ −k .
23
of this policy in any episode of length k + 1 with an initial state s0 ∈ {x0 , l}. Since the testing state
xk is the only state to be rewarded, we can already note that Vπ (x0 ) = π(f ∗ (xk )|T, xk ), that is, the
probability of the policy correctly labeling the testing state xk after having experienced the training
states T . Also, since s0 ∈ {x0 , l} and Vπ (l) = 0, we have Vπ (x0 ) ≥ Vπ (s0 ).
With this construction, we obtain the following result:
Lemma 6 (Quantum advantage in Deterministic-DLP). There exists a uniform family of
Deterministic-DLP POMDPs (exponentially many instances for a given concept fs of the DLP
classification problem) where:
1) (classical hardness) if there exists a classical learning agent which, when placed in a randomly
chosen instance of the environment, has value Vc (s0 ) ≥ 1/2 + 1/poly(n) (that is, 1/poly(n) better
than a random agent), with probability at least 0.845 over the choice of environment and the random-
ness of its learning algorithm, then there exists an efficient classical algorithm to solve DLP,
2) (quantum learnability) there exists a class of quantum agents that attains a value Vq (s0 ) = 1 (that
is, the optimal value) with probability at least 0.98 over the choice of environment and randomness
of the learning algorithm.
The proof of this lemma is deferred to Appendix I for clarity.

By combining our three lemmas, and taking the weakest separation claim for the cases ii) and iii), we
get Theorem 1. For the interested reader, we list the following remarks, relating to the proofs of these
lemmas:
• SL-DLP and Deterministic-DLP are the two closest environments to the DLP classification task
of Liu et al. While the value function in SL-DLP is trivially equivalent to the accuracy of the
classification problem, we find the value function in Deterministic-DLP to be weaker than this
accuracy. Namely, a high accuracy trivially leads to a high value while a high (or non-trivial) value
does not necessarily lead to a high (or non-trivial) accuracy (in all these cases, the high probability
over the randomness of choosing the environments and of the learning algorithms is implied). This
explains why the classical hardness statement for Deterministic-DLP is weaker than in SL-DLP.
• In Cliffwalk-DLP, it is less straightforward to relate the testing accuracy of a policy to its per-
formance on the deterministic parts of the environment, which explains why we trivially upper
bound this performance by 0 on these parts. We believe however that these deterministic parts will
actually make the learning task much harder, since they strongly restrict the part of the state space
the agents can see. This claim is supported by our numerical experiments in Sec. 4.2. Also, since
we showed classical hardness for fully deterministic environments, it would be simple to construct
a variant of Cliffwalk-DLP where these deterministic parts would be provably hard as well.
H Proof of Lemma 5
Consider a slippery cliffwalk environment defined by a labeling function f ∗ in the concept class C of
Liu et al. This cliffwalk has p − 1 states ordered, w.l.o.g., in their natural order, and correct actions
(the ones that do not lead to immediate “death") f ∗ (i) for each state i ∈ Z∗p . For simplicity of our
proofs, we also consider circular boundary conditions (i.e, doing the correct action on the state p − 1
of the cliff leads to the state 1), random slipping at each interaction step to a uniformly sampled state
on the cliff with probability δ > 0, an initialization of each episode in a uniformly sampled state
i ∈ Z∗p , and a 0 (−1) reward for doing the correct (wrong) action in any given state.
H.1 Upper bound on the value function
The value function of any policy π which has probability π(i) (we abbreviate π(f ∗ (i)|i) to π(i)) of
doing the correct action in state i ∈ Z∗p is given by:
 
p−1
X
1
Vπ (i) = π(i)γ (1 − δ)Vπ (i + 1) + δ Vπ (j) − (1 − π(i)) (11)
p − 1 j=1
24
Since this environment only has negative rewards, we have that Vπ (i) ≤ 0 for any state i and policy
π, which allows us to write the following inequality:
 
p−1
X
1
Vπ (i) ≤ π(i)γ δ Vπ (j) − (1 − π(i))
p − 1 j=1
We use this inequality to bound the following term:

 
p−1
X p−1
X p−1
X
1 1 π(i) γδ
Vπ (i) ≤ Vπ (j) − (1 − π(i))
p − 1 i=1 p − 1 i=1 p − 1 j=1
p−1
! p−1

1 X γδ X
= π(i)  Vπ (j) + 1 − 1
p − 1 i=1 p − 1 j=1
We note that the first factor is exactly the accuracy of the policy π on the classification task of Liu et
al.:
p−1
1 X
Accπ (f ∗ ) = π(i).
p − 1 i=1
We hence have:
 
p−1 p−1
1 X 1 X
Vπ (i) ≤ Accπ (f ∗ ) γδ Vπ (j) + 1 − 1
p − 1 i=1 p − 1 j=1
which is equivalent to:

p−1
1 X Accπ (f ∗ ) − 1
Vπ (i) ≤
p − 1 i=1 1 − Accπ (f ∗ )γδ
when Accπ (f ∗ )γδ < 1.
We now note that this average value function is exactly the value function evaluated on the initial
state s0 of the agent, since this state is uniformly sampled from Z∗p for every episode. Hence,
Accπ (f ∗ ) − 1
Vπ (s0 ) ≤ (12)
1 − Accπ (f ∗ )γδ
H.2 Lower bound on the value function
Again, by noting in Eq. (11) that we have Vπ (i) ≤ 0 and π(i) ≤ 1 for any policy π and state i ∈ Z∗p ,
we have:  
p−1
X
δ
Vπ (i) ≥ γ (1 − δ)Vπ (i + 1) + Vπ (j) − (1 − π(i))
p − 1 j=1
We use this inequality to bound the value function at the initial state s0 :
p−1
1 X
Vπ (s0 ) = Vπ (i)
p − 1 i=1
 
p−1
X p−1
X p−1
1 − δ δ 1 X
≥γ Vπ (i + 1) + Vπ (j) + π(i) − 1
p − 1 i=1 p − 1 j=1 p − 1 i=1
= γ ((1 − δ)Vπ (s0 ) + δVπ (s0 )) + Accπ (f ∗ ) − 1

= γVπ (s0 ) + Accπ (f ∗ ) − 1
by using the circular boundary conditions of the cliffwalk in the third line.
This inequality is equivalent to:
Accπ (f ∗ ) − 1
Vπ (s0 ) ≥ (13)
1−γ
when γ < 1.
25
H.3 Bounds for classical hardness and quantum learnability
We use the bounds derived in the two previous sections to prove classical hardness and quantum
learnability of this task environment, as stated in Lemma 5.
For this, we start by noting the following expression for the value function of a random policy (one
that does random actions in all states):
 
p−1
X p−1
X
γ 1−δ δ 1
Vrand (s0 ) =  Vrand (i + 1) + Vrand (j) −
2 p − 1 i=1 p − 1 j=1 2
γ 1 1
= Vrand (s0 ) − = −
2 2 2−γ
again due to the circular boundary conditions of the cliffwalk and the resulting absence of termination
conditions outside of “death".
As for the value function of the optimal policy, this is trivially Vopt = 0.
H.3.1 Proof of classical hardness

For any policy π, we define the function g(x, δ, γ) = V (x, δ, γ) − Vrand (γ), where we adopt the
short-hand notation x = Accπ (f ∗ ) and call V the upper bound on the value function Vπ (s0 ) of π.
The expression of g(x, δ, γ) (for (x, δ, γ) 6= (1, 1, 1)) is given by:
x−1 1
g(x, δ, γ) = + (14)
1 − δγx 2 − γ
To prove classical hardness, it is sufficient to show that x ≤ 0.51 implies g(x, δ, γ) ≤ 0.1 for
δ ∈ [δ0 , 1], γ ∈ [0, γ1 ] and a {δ0 , γ1 } pair of our choosing. To see this, notice that the contraposition
gives x = Accπ (f ∗ ) > 0.51 which is sufficient to construct an efficient algorithm that solves DLP. To
achieve this result, we show the three following inequalities, ∀ x ≤ 0.51 and ∀ (δ, γ) ∈ [δ0 , 1]×[0, γ1 ]:
(i) (ii) (iii)
g(x, δ, γ) ≤ g(0.51, δ, γ) ≤ g(0.51, δ0 , γ) ≤ g(0.51, δ0 , γ1 )
where δ0 and γ1 are chosen such that g(0.51, δ0 , γ1 ) ≤ 0.1.
Proof of (i). We look at the derivative of g w.r.t. x:

∂g(x, δ, γ) 1 − δγ
= ≥ 0 ∀(x, δ, γ) ∈ [0, 1]3 \(1, 1, 1)
∂x (1 − δγx)2
and hence g is an increasing function of x, which gives our inequality.
Proof of (ii). We look at the derivative of g w.r.t. δ:

∂g(x, δ, γ) γ(x − 1)x
= ≤ 0 ∀(x, δ, γ) ∈ [0, 1]3 \(1, 1, 1)
∂δ (1 − δγx)2
and hence g is a decreasing function of δ, which gives our inequality.
Proof of (iii). We look at the derivative of g w.r.t. γ:

∂g(x, δ, γ) δ(x − 1)x 1
= 2
+ ∀(x, δ, γ) ∈ [0, 1]3 \(1, 1, 1)
∂γ (1 − δγx) (2 − γ)2
We have:
∂g(x, δ, γ)
≥ 0 ⇔ (δx)2 + δ(x2 − x) γ 2 − 2δ(2x2 − x)γ + 4δ(x2 − x) + 1 ≥ 0
∂γ
By setting x = 0.51 and δ = 0.86, we find
∂g(0.51, 0.86, γ)
≥ 0 ∀γ ∈ [0, 1]
∂γ
since the roots of the second-degree polynomial above are approximately {−2.91, 2.14} and we have
(δx)2 + δ(x − 1)x ≈ −0.0225 < 0.
Hence g(0.51, δ0 , γ) is an increasing function of γ, which gives our inequality.
26
Given that g(0.51, 0.86, 0.9) ≈ 0.0995 < 0.1, we then get our desired result for δ0 = 0.86 and
γ1 = 0.9. Noting that Vπ (s0 ) − Vrand (γ) ≤ g(x, δ, γ) ≤ 0.1 from Eq. (12), we hence have classical
hardness ∀ (δ, γ) ∈ [δ0 , 1] × [0, γ1 ].
H.3.2 Proof of quantum learnability

Proving quantum learnability is more trivial, since, for Accπ (f ∗ ) ≥ 0.99 and γ ≤ 0.9, we directly
have, using Eq. (13):
Vπ (s0 ) ≥ −0.1 = Vopt − 0.1
To conclude this proof, we still need to show that we can obtain in this environment a policy π
such that Accπ (f ∗ ) ≥ 0.99 with high probability. For that, we use agents that first collect poly(n)
distinct samples (states s and their inferred labels f ∗ (s)) from the environment (distinct in order to
avoid biasing the distribution of the dataset with the cliffwalk temporal structure). This can be done
efficiently in poly(n) interactions with the environment, since each episode is initialized in a random
state s0 ∈ Z∗p . We then use the learning algorithm of Liu et al. to train a classifier π with the desired
accuracy, with high probability.
I Proof of Lemma 6
I.1 Proof of classical hardness
Suppose that a polynomial-time classical agent achieves a value Vc (s0 ) ≥ 12 + poly(n)

1
with probability
(1−δ) over the choice of environment and the randomness of its learning algorithm. We call “success"
the event Vc (s0 ) ≥ 12 + poly(n)
1
and Sδ the subset of the instances S = {T, xk } for which, theoretically,
a run of the agent would “succeed" (this is hence a set that depends on the randomness of the agent).
Note that, on every instance in Sδ , π(f ∗ (xk )|T, xk ) = Vc (x0 ) ≥ Vc (s0 ) ≥ 12 + poly(n)
1
. Since this
probability is bounded away from 1/2 by an inverse polynomial, this means that we can “boost" it to
a larger probability (1 − ε). More specifically, out of the policy π obtained after interacting fork steps
with the environment, we define a classifier fc acting on xk such that we sample O log ε−1 -many
times from π(a|T, xk ) and label xk by majority vote. For the instances in Sδ , the probability of
correctly labeling xk is Pr [fc (xk ) = f ∗ (xk )] ≥ 1 − ε.
Define P (T ) = Pr[T = T ] and P (xk ) = Pr[xk = xk ] the probabilities of sampling certain training
states T and a testing state xk , when choosing an instance of the environment. We now look at the
following quantity:
X X
EP (T ) [Accfc (T )] = P (T ) P (xk )Pr [fc (xk ) = f ∗ (xk )|T, xk ]
T xk
X
= P (T, xk )Pr [fc (xk ) = f ∗ (xk )|T, xk ]
T,xk
X
≥ P (T, xk )Pr success|T, xk × Pr fc (xk ) = f ∗ (xk )|T, xk , success
T,xk
≥ (1 − δ)(1 − ε)
P
since Pr [fc (xk ) = f ∗ (xk )|T, xk ] ≥ 1 − ε for instances in Sδ and T,xk P (T, xk )Pr success|T, xk
≥ 1 − δ by definition.
In the following, we set 1 − ε = 0.999 and 1 − δ ≥ 0.845 (the reason for this becomes apparent
below), such that:
5 1
EP (T ) [Accfc (T )] ≥ 0.844155 > + (15)
6 96
Now, consider the following learning algorithm: given a training set T , construct a Deterministic-DLP
environment that uses this T and a randomly chosen xk , and define the classifier fc that boosts the
π(a|T, xk ) obtained by running our classical agent on this environment (as explained above). We
want to show that fc has accuracy Accfc (T ) ≥ 21 + poly(n)
1
with probability at least 2/3 over the
choice of T and the randomness of its construction, which is sufficient to solve DLP classically. For
that, we show a stronger statement. Call Tsucc the subset of all instances of training states T = {T }
27
1 1 2|T |
for which Accfc (T ) ≥ 2 + poly(n) . We prove by contradiction that |Tsucc | ≥ 3 :
2|T |
Assume |Tsucc | < 3 , then
X
EP (T ) [Accfc (T )] = P (T )Accfc (T )
T
 
1  X X
= Accfc (T ) + Accfc (T )
|T |
T ∈Tsucc T ∈T
/ succ

|Tsucc | |T | − |Tsucc | 1 1
< ×1+ +
|T | |T | 2 poly(n)
5 1
< + < 0.844155
6 3poly(n)
for large enough n, in contradiction with Eq. (15).
Hence, with probability at least 2/3 over the choice of training states and the randomness of the
learning algorithm, our constructed classifier has accuracy Accfc (T ) ≥ 12 + poly(n) 1
. By using
Theorem 8, Remark 1 of Liu et al., this is sufficient to construct an efficient classical algorithm that
solves DLP.
I.2 Proof of quantum learnability
Using the learning algorithm of Liu et al., we can construct a quantum classifier that achieves accuracy
Accq (T ) ≥ 0.99 with probability at least 2/3 over the randomness of the learning algorithm and the
choice of training states T , of length |T |
= poly(n). Now define instead training states T of length
|T | = M poly(n), for M = O log δ 0−1 (hence |T | is still polynomial in n), and use each of the
M segments of T to train M independent quantum classifiers. Define fq as a classifier that labels xk
using a majority vote on the labels assigned by each of these classifiers. This constructed classifier
has accuracy Accfq (T ) ≥ 0.99 with now probability 1 − δ 0 over the choice of training states T and
the randomness of the learning algorithm.
We then note that, by calling “success" the event Accfq (T ) ≥ 0.99, we have:
X
P (T, xk )Pr Vq (x0 ) = 1|T, xk
T,xk
X X
≥ P (T ) P (xk )Pr success|T × Pr Vq (x0 ) = 1|T, xk , success
T xk
X X
= P (T )Pr success|T P (xk ) × Pr fq (xk ) = f ∗ (xk )|T, xk , success
T xk
X
= P (T )Pr success|T Accfq (T )
T
≥ (1 − δ 0 ) × 0.99
which means that our constructed agent achieves a value Vq (x0 ) = 1 (which also implies Vq (s0 ) = 1)
with probability at least (1 − δ 0 ) × 0.99 over the choice of environment and the randomness of the
learning algorithm. By setting (1 − δ 0 ) = 0.98/0.99, we get our statement.
J Construction of a PQC agent for the DLP environments

In the two following appendices, we construct a PQC classifier that can achieve close-to-optimal
accuracy in the classification task of Liu et al. [14] (see Appendix F), and can hence also be used as a
learning model in the DLP environments defined in Sec. 4.1.
J.1 Implicit v.s. explicit quantum SVMs
To understand the distinction between the quantum learners of Liu et al. and the PQC policies we are
constructing here, we remind the reader of the two models for quantum SVMs defined in Ref. [7]:
28
the explicit and the implicit model. Both models share a feature-encoding unitary U (x) that encodes
data points x into feature state |φ(x)i = U (x) |0⊗n i.
In the implicit model, one first evaluates the kernel values
2
K(xi , xj ) = |hφ(xi )|φ(xj )i| (16)
for the feature states associated to every pair of data points {xi , xj } in the dataset, then uses the
resulting kernel matrix in a classical SVM algorithm.P This algorithm returns a hyperplane classifier
in feature space, defined by its normal vector hw| = i αi hφ(xi )| and bias b, such that the sign of
2
|hw|φ(x)i| + b gives the label of x.
In the explicit model, the classifier is instead obtained by training a parametrized |wθ i. Effectively,
this classifier is implemented by applying a variational unitary V (θ) on the feature states |φ(x)i and
2
measuring the resulting quantum states using a fixed observable, with expectation value |hwθ |φ(x)i| .
In the following sections, we describe how the implicit quantum SVMs of Liu et al. can be transformed
into explicit models while guaranteeing that they can still represent all possible optimal policies in
the DLP environments. And in Appendix K, we show that, even under similar noise considerations as
Liu et al., these optimal policies can also be found using poly(n) random data samples.
J.2 Description of the PQC classifier
As we just described, our classifier belongs to a family of so-called explicit quantum SVMs. It
is hence described by a PQC with two parts: a feature-encoding unitary U (x), which creates
features |φ(x)i = U (x) |0⊗n i when applied to an all-0 state, followed by a variational circuit V (θ)
parametrized by a vector θ. The resulting quantum state is then used to measure the expectation value
hOix,θ of an observable O, to be defined. We rely on the same feature-encoding unitary U (x) as the
one used by Liu et al., i.e., the unitary that creates feature states of the form
k
2 −1
1 X
|φ(x)i = √ x · gi (17)
k
2 i=0
for k = n − t log(n), where t is a constant defined later, under noise considerations. This feature
0
state can be seen as the uniform superposition of the image (under exponentiation s0 7→ g s ) of an
k
interval of integers [logg (x), logg (x) + 2 − 1] in log-space. Note that U (x) can be implemented in
e 3 ) operations [14].
O(n
functions fs ∈ C to be learned (see Eq. (9)) is delimiting two equally-
By noting that every labeling
sized intervals of log Z∗p , we can restrict the decision boundaries to be learned by our classifier
to be half-space dividing hyperplanes in log-space. In feature space, this is equivalent to learning
separating hyperplanes that are normal to quantum states of the form:
(p−3)/2
1 X 0 E
|φs0 i = p g s +i . (18)
(p − 1)/2 i=0
Noticeably, for input points x such that logg (x) is away from some delimiting regions around s and
2 k+1
s + p−3 2
2 , we can notice that the inner product |hφ(x)|φs i| is either ∆ = p−1 or 0, whenever x is
labeled +1 or −1 by fs , respectively. This hence leads to a natural classifier to be built, assuming
2
overlaps of the form |hφ(x)|φs0 i| can be measured:
2
1, if |hφ(x)|φs0 i| /∆ ≥ 1/2,
hs0 (x) = (19)
−1, otherwise
which has an (ideal) accuracy 1 − ∆ whenever s0 = s.
To complete the construction of our PQC classifier, we should hence design the composition of its
variational part V (θ) and measurement O such that they result in expectation values of the form
2
hOix,θ = |hφ(x)|φs0 i| . To do this, we note that, for |φs0 i = V̂ (s0 ) |0i, the following equality holds:

2
2
|hφ(x)|φs0 i| = 0⊗n V̂ (s0 )† U (xi ) 0⊗n

= Tr 0⊗n 0⊗n ρ(x, s0 )
29
where ρ(x, s0 ) = |ψ(x, s0 )i hψ(x, s0 )| is the density matrix of the quantum state |ψ(x, s0 )i =
V̂ (s0 )† U (xi ) |0⊗n i. Hence, an obvious choice of variational circuit is V (θ) = V̂ (s0 ), combined with
a measurement operator O = |0⊗n i h0⊗n |. Due to the similar nature of |φ0s i and |φ(x)i, it is possible
0
to use an implementation for V̂ (s0 ) that is similar to that of U (xi ) (take xi = g s and k ≈ n/2).3 We
also note that, for points x such that logg (x) is (p − 1)∆/2 away from the boundary regions of hs0 ,
2
the non-zero inner products |hφ(x)|φs0 i| are equal to ∆ = O(n−t ). These can hence be estimated
efficiently to additive error, which allows to efficiently implement our classifier hs0 (Eq. (19)).
J.3 Noisy classifier

2
In practice, there will be noise associated with the estimation of the inner products |hφ(x)|φs0 i| ,
namely due to the additive errors associated to sampling. Similarly to Liu et al., we model noise by
0
introducing a random variable eis0 for each data point xi and variational parameter g s , such that
2
the estimated inner product is |hφ(xi )|φs0 i| + eis0 . This random variable satisfies the following
equations: 
 eis0 ∈ [−∆, ∆]
E[eis0 ] = 0

Var[eis0 ] ≤ 1/R
where R is the number of circuit evaluations used to estimate the inner product. We hence end up
with a noisy classifier:
(
2
e 1, if |hφ(x i )|φ s 0 i| + eis0 /∆ ≥ 1/2,
hs0 (xi ) =
−1, otherwise
The noise has the effect that some points which would be correctly classified by the noiseless classifier
have now a non zero probability of being misclassified. To limit the overall decrease in classification
accuracy, we focus on limiting the probability of misclassifying points xi such that logg (xi ) is
(p − 1)∆/2 away from the boundary points s0 and s0 + p−3 2 of gs . We call Is the subset of Zp
0 0
∗
comprised of these points. For points in Is0 , the probability of misclassification is that of having
|eis0 | ≥ ∆/2. We can use Chebyshev’s inequality to bound this probability:

∆ 4
Pr |eis0 | ≥ ≤ 2 (20)
2 ∆ R
since E[eis0 ] = 0 and Var[eis0 ] ≤ 1/R.
K Proof of trainability of our PQC agent in the SL-DLP environment

0
In this Appendix, we describe an optimization algorithm to train the variational parameter g s of
the PQC classifier we defined in Appendix J. This task is non-trivial for three reasons: 1) even by
restricting the separating hyperplanes accessible by our classifier, there are still p − 1 candidates,
which makes an exhaustive search for the optimal one intractable; 2) noise in the evaluation of the
classifier can potentially heavily perturb its loss landscape, which can shift its global minimum
and 3) decrease the testing accuracy of the noisy classifier. Nonetheless, we show that all these
considerations can be taken into account for a simple optimization algorithm, such that it returns a
classifier with close-to-optimal accuracy with high probability of success. More precisely, we show
the following Theorem:
n o
log(δ/2)
Theorem 3. For a training set of size nc such that c ≥ max logn (8/δ), logn log(1−2n −t ) for
t ≥ max {3 logn (8/δ), log
n n (16/ε)} inothe definition of ∆, and a number of circuit evaluations per
2(t+c)
inner product R ≥ max 4n δ , 128 ε3 , then our optimization algorithm returns a noisy classifier
e
hs0 with testing accuracy Acceh 0 (fs ) on the DLP classification task of Liu et al. such that
s

Pr Acceh 0 (fs ) ≥ 1 − ε ≥ 1 − δ.
s
0
Note that we write V̂ (s0 ) and Us0 to be parametrized by s0 but the true variational parameter here is g s ,
3
since we work in input space and not in log-space.
30
The proof of this Theorem is detailed below.
Given a training set X ⊂ X polynomially large in n, i.e., |X| = nc , define the training loss:
1 X
L(s0 ) = |hs0 (x) − fs (x)|
2|X|
x∈X
and its noisy analog:
e 0) = 1 X e

L(s hs0 (x) − fs (x)
2|X|
x∈X
Our optimization
algorithm goes as follows: using the noisy classifier e
hs0 , evaluate the loss function
e s0
L logg (x) for each variational parameter g = x ∈ X, then set
0
e
g s = argminx∈X L(log g (x)).
This algorithm is efficient in the size of the training set, since it only requires |X| evaluations of e
2
hs0 .
To prove Theorem 3, we show first that we can enforce argminx∈X L(logg (x)) = e
argminx∈X L(logg (x)) with high probability (Lemma 7), and second, that this algorithm also leads
to s0 close to the optimal s in log-space with high probability (Lemma 8).
Lemma 7. For a training set of size nc such that c ≥ logn (8/δ), a t ≥ 3c in the definition of ∆, and
2(t+c)
a number of circuit evaluations per inner product R ≥ 4n δ , we have

e δ
Pr argmin L(log g (x)) = argmin L(log g (x)) ≥1−
x∈X x∈X 2
Proof. In order for the minima of the two losses to be obtained for the same x ∈ X, it is sufficient to
ensure that the classifiers hlogg (xi ) and e
hlogg (xi ) agree on all points xj , for all (xi , xj ) ∈ X. This
can be enforced by having:
   
 \  \ ∆
 xi ∈ Ilogg (xj )  ∩  |ei,s0 | ≤ 
i,j 0
2
i,s
i6=j
that is, having for all classifiers hlogg (xj ) that all points xi ∈ X, xi 6= xj , are away from its boundary
regions in log-space, and that the labels assigned to these points are all the same under noise.
We bound the probability of the negation of this event:
     
 [ [ ∆  [  [ ∆
Pr  xi ∈ / Ilogg (xj ) ∪ |ei,s0 | ≥  ≤ Pr  xi ∈ / Ilogg (xj )  + Pr  |ei,s0 | ≥ 
i,j 0
2 i,j 0
2
i,s i,s
i6=j i6=j
using the union bound.

We start by bounding the first probability, again using the union bound:
 

[  X
Pr  xi ∈ / Ilogg (xj )  ≤ Pr xi ∈/ Ilogg (xj )
i,j i,j
i6=j i6=j
X∆ n2c ∆
= ≤
i,j
2 2
i6=j
By setting t ≥ 3c, we have ∆ ≤ 4n−t ≤ 4n−3c , which allows us to bound this first probability by
δ/4 when c ≥ logn (8/δ).
As for the second probability above, we have
 
[ ∆ X ∆

Pr  |ei,s0 | ≥  ≤ Pr |ei,s0 | ≥
0
2 0
2
i,s i,s
4n2c
≤
∆2 R
31
2(t+c)
16n2c
using the union bound and Eq. (20). By setting R ≥ 4n δ ≥ ∆2 δ (since ∆ ≥ 2n−t ), we can
bound this second probability by δ/4 as well, which gives:
 
[ [
e  ∆
Pr argmin L(log g (x)) = argmin L(logg (x)) ≥ 1 − Pr  xi ∈
/ Ilogg (xj ) ∪ |ei,s0 | ≥ 
x∈X x∈X i,j 0
2
i,s
i6=j
≥ 1 − δ/2

log(δ/2)
Lemma 8. For a training set of size nc such that c ≥ logn log(1−2ε) , then s0 =

logg argminx∈X L(logg (x)) is within ε distance of the optimal s with probability:
0
|s − s| δ
Pr ≤ε ≥1−
p−1 2
Proof. We achieve this result by proving:

0
|s − s| δ
Pr ≥ε ≤
p−1 2
This probability is precisely the probability that no logg (x) ∈ logg (X) is within ε distance of s, i.e.,
!
\
Pr log(x) ∈/ [s − ε(p − 1), s + ε(p − 1)]
x∈X
As the elements of the training set are all i.i.d., we have that this probability is equal to
|X|
Pr (log(x) ∈
/ [s − ε(p − 1), s + ε(p − 1)])
Since all the datapoints are uniformly sampled from Z∗p , the probability that a datapoint is in any
region of size 2ε(p − 1) is just 2ε. With the additional assumption that |X| = nc ≥ log1−2ε (δ/2)
(and assuming ε < 1/2), we get:
0
|s − s| δ
Pr ≥ ε ≤ (1 − 2ε)log1−2ε (δ/2) =
p−1 2
Lemma 7 and Lemma 8 can be used to prove:

n o
log(δ/2)
Corollary 1. For a training set of size nc such that c ≥ max logn (8/δ), logn log(1−2ε) ,a
4n2(t+c)
t ≥ 3c in the definition of ∆, and a number of circuit evaluations per inner product R ≥ δ ,
0
then our optimization algorithm returns a variational parameter g s such that
0
|s − s|
Pr ≤ε ≥1−δ
p−1
From here, we notice that, when we apply Corollary 1 for ε0 ≤ ∆ 2 , our optimization algorithm returns
an s0 such that, with probability 1 − δ, the set Is0 is equal to Is and is of size (p − 1)(1 − 2∆). In the
event where |s0 − s|/(p − 1) ≤ ε0 ≤ ∆ 2 , we can hence bound the accuracy of the noisy classifier:
1 X e
Acceh 0 (fs ) = Pr hs0 (x) = fs (x)
s p−1
x∈X
1 X e
≥ Pr hs0 (x) = fs (x)
p−1
x∈Is

∆
≥ (1 − 2∆) min Pr |ei,s0 | ≤
xi ∈Is 2

4
≥ (1 − 2∆) 1 − 2
∆ R

4 4
= 1 − 2∆ 1 − 2 + 2
∆ R ∆ R
32
with probability 1 − δ. n 2(t+c) o
We now set t ≥ max {3 logn (8/δ), logn (16/ε)}, ε0 = n−t and R ≥ max 4n δ , 128
ε 3 , such that
0 −t −t ε 4
4 ε
2ε = 2n ≤ ∆ ≤ 4n ≤ 4 , 1 − ∆2 R ≤ 1 and ∆2 R ≤ 2 .
Using these inequalities, we get
Acceh 0 (fs ) ≥ 1 − ε
s
with probability 1 − δ, which proves Theorem 3.
33

Parametrized Quantum Policies For Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Parametrized Quantum Policies For Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parametrized Quantum Policies For Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Parametrized Quantum Policies

for Reinforcement Learning

Sofiene Jerbi Casper Gyurik Simon C. Marshall

Hans J. Briegel Vedran Dunjko

35th Conference on Neural Information Processing Systems (NeurIPS 2021).

Figure 1: Training parametrized quantum policies for reinforcement learning. We consider a

2 Parametrized quantum policies: definitions and learning algorithm

2.1 Quantum computation: a primer

A quantum system composed of n qubits is represented by a 2n -dimensional complex Hilbert space

2.2 The RAW-PQC and SOFTMAX -PQC policies

2.3 Learning algorithm

2.4 Efficient policy sampling and policy-gradient evaluation

3 Performance comparison in benchmarking environments

Average collected rewards

3.1 RAW-PQC v.s. SOFTMAX -PQC

3.2 Influence of architectural choices

4 Quantum advantage of PQC agents in RL environments

4.1 Quantum advantage of PQC policies over any classical learner

In this subsection, we construct RL environments with theoretical guarantees of separation between

4.2 Quantum advantage of PQC policies over DNN policies

4.2.1 PQC-generated environments

Average collected rewards

• SL-PQC: this degenerate RL environment encodes a classification task in an episodic RL environ-

4.2.2 Performance comparison

• Modern-day RL is known to be very resource-heavy in terms of compute power and energy

Acknowledgments and Disclosure of Funding

Outline The Supplementary Material is organized as follows. In Appendix A, we derive the

A Derivation of the log-policy gradient

B Efficient implementation of SOFTMAX -PQC policies

In this section we prove Lemma 2, restated below:

and used to compute an approximate policy

πθ (a|s), πθ (a|s)} ∈ [e−2βε πθ (a|s), e2βε πθ (a|s)] in the first inequality.

B.2 Efficient estimation of the log-policy gradient

and similarly for ∂i log πθ (a|s).

where we summed the first inequalities over all a. Hence:

fa is,θ | ≤ ε, ∀a, i, we finally have:

C The role of trainable observables in SOFTMAX -PQC policies

C.1 Training the eigenbasis and the eigenvalues of an observable

C.2 The power of universal observables

Equivalently to the universal family of observables {V † (θ)OV

 Pole angle or cart position

Acrobot-v1 6 3 500 −1 until termination Reaching goal or horizon

+1 for good action

+1 for good action  Doing wrong action

2 to 5 +1 for good action

E Deferred plots and shape of policies learned by PQCs v.s. DNNs

softmax-PQC [0.01, 0.1, 0.1] 1 1 10 {1, 5} 4

softmax-PQC [0.01, 0.1, 0.01] 1 1.5 10 {4, 6} 2

softmax-PQC [0.01, 0.1, 0.1] 1 1 10 {2, 5} 6

softmax-PQC [0.01, 0.1, 0.01] 0.9 1 10 4 2

softmax-PQC [0.01, 0.1, 0.1] 0.9 1 10 4 2

CognitiveRadio softmax-PQC [0.01, 0.1, 0.1] 0.9 1 1 3 2 to 5

softmax-PQC All-to-all Yes [wZ0 Z1 Z2 Z3 , (− . . .)] {31, 119} No

softmax-PQC One-to-one No [w0 Z0 , w1 Z0 Z1 , w2 Z1 ] {39, 55} Yes

softmax-PQC One-to-one No [wZ0 Z1 , (− . . .)] 37 No

softmax-PQC One-to-one No [wZ0 Z1 , (− . . .)] 37 No

CognitiveRadio softmax-PQC Circular No [w0 Z0 , w1 Z1 , . . . , wn Zn ] 30 to 75 No

E.3 Additional numerical simulation on the CognitiveRadio environment

softmax-PQC softmax-PQC softmax-PQC softmax-PQC

Pole angle or cart position

+1 for good action Doing wrong action