Parametrized Quantum Policies For Reinforcement Learning
Parametrized Quantum Policies For Reinforcement Learning
Parametrized Quantum Policies For Reinforcement Learning
Abstract
With the advent of real-world quantum computing, the idea that parametrized
quantum computations can be used as hypothesis families in a quantum-classical
machine learning system is gaining increasing traction. Such hybrid systems have
already shown the potential to tackle real-world tasks in supervised and generative
learning, and recent works have established their provable advantages in special
artificial tasks. Yet, in the case of reinforcement learning, which is arguably most
challenging and where learning boosts would be extremely valuable, no proposal
has been successful in solving even standard benchmarking tasks, nor in showing a
theoretical learning advantage over classical algorithms. In this work, we achieve
both. We propose a hybrid quantum-classical reinforcement learning model using
very few qubits, which we show can be effectively trained to solve several standard
benchmarking environments. Moreover, we demonstrate, and formally prove, the
ability of parametrized quantum circuits to solve certain learning tasks that are
intractable to classical models, including current state-of-art deep neural networks,
under the widely-believed classical hardness of the discrete logarithm problem.
1 Introduction
Hybrid quantum machine learning models constitute one of the most promising applications of
near-term quantum computers [1, 2]. In these models, parametrized and data-dependent quantum
computations define a hypothesis family for a given learning task, and a classical optimization
algorithm is used to train them. For instance, parametrized quantum circuits (PQCs) [3] have already
proven successful in classification [4–8], generative modeling [9, 10] and clustering [11] problems.
Moreover, recent results have shown proofs of their learning advantages in artificially constructed
tasks [6, 12], some of which are based on widely believed complexity-theoretic assumptions [12–15].
All these results, however, only consider supervised and generative learning settings.
Arguably, the largest impact quantum computing can have is by providing enhancements to the
hardest learning problems. From this perspective, reinforcement learning (RL) stands out as a field
that can greatly benefit from a powerful hypothesis family. This is showcased by the boost in learning
performance that deep neural networks (DNNs) have provided to RL [16], which enabled systems
like AlphaGo [17], among other achievements [18, 19]. Nonetheless, the true potential of near-term
quantum approaches in RL remains very little explored. The few existing works [20–23] have failed
so far at solving classical benchmarking tasks using PQCs and left open the question of their ability
to provide a learning advantage.
𝜽 𝑠, 𝜽 𝜽 𝑠, 𝜽 𝜽 𝑎
|0ۧ
𝜋𝜽 (𝑎|𝑠)
|0ۧ 𝑂𝑎 𝑠,𝜽
∇𝜽 log 𝜋𝜽
|0ۧ
Contributions In this work, we demonstrate the potential of policies based on PQCs in solving
classical RL environments. To do this, we first propose new model constructions, describe their learn-
ing algorithms, and show numerically the influence of design choices on their learning performance.
In our numerical investigation, we consider benchmarking environments from OpenAI Gym [24], for
which good and simple DNN policies are known, and in which we demonstrate that PQC policies can
achieve comparable performance. Second, inspired by the classification task of Havlíček et al. [6],
conjectured to be classically hard by the authors, we construct analogous RL environments where
we show an empirical learning advantage of our PQC policies over standard DNN policies used in
deep RL. In the same direction, we construct RL environments with a provable gap in performance
between a family of PQC policies and any efficient classical learner. These environments essentially
build upon the work of Liu et al. [14] by embedding into a learning setting the discrete logarithm
problem (DLP), which is the problem solved by Shor’s celebrated quantum algorithm [25] but widely
believed to be classically hard to solve [26].
Related work Recently, a few works have been exploring hybrid quantum approaches for RL.
Among these, Refs. [20, 21] also trained PQC-based agents in classical RL environments. However,
these take a value-based approach to RL, meaning that they use PQCs as value-function approximators
instead of direct policies. The learning agents in these works are also tested on OpenAI Gym
environments (namely, a modified FrozenLake and CartPole), but do not achieve sufficiently good
performance to be solving them, according to the Gym specifications. Ref. [27] shows that, using
some of our design choices for PQCs in RL (i.e., data re-uploading circuits [28] with trainable
observable weights and input scaling parameters), one can also solve these environments using a
value-based approach. An actor-critic approach to QRL was introduced in Ref. [22], using both a PQC
actor (or policy) and a PQC critic (or value-function approximator). In contrast to our work, these
are trained in quantum environments (e.g., quantum-control environments), that provide a quantum
state to the agent, which acts back with a continuous classical action. These aspects make it a very
different learning setting to ours. Ref. [23] also describes a hybrid quantum-classical algorithm for
value-based RL. The function-approximation models on which this algorithm is applied are however
not PQCs but energy-based neural networks (e.g., deep and quantum Boltzmann machines). Finally,
our work provides an alternative approach to take advantage of quantum effects in designing QRL
agents compared to earlier approaches [29–33], which are mainly based on (variations of) Grover’s
search algorithm [34] or quantum annealers [35] to speed up sampling routines.
Code An accompanying tutorial [36], implemented as part of the quantum machine learning library
TensorFlow Quantum [37], provides the code required to reproduce our numerical results and explore
different settings. It also implements the Q-learning approach for PQC-based RL of Skolik et al. [27]
2
Uvar (φ0 ) Uenc (s, λ0 )
|0i0 H Rz (φ0,0 ) Ry (φ0,2 ) • Ry (λ0,0 s0 ) Rz (λ0,2 s0 )
Uvar (φ1 )
|0i1 H Rz (φ0,1 ) Ry (φ0,3 ) • Ry (λ0,1 s1 ) Rz (λ0,3 s1 )
Figure 2: PQC architecture for n = 2 qubits and depth Denc = 1. This architecture is composed
of alternating layers of encoding unitaries Uenc (s, λi ) taking as input a state vector s = (s0 , . . . , sd−1 )
and scaling parameters λi (part of a vector λ ∈ R|λ| of dimension |λ|), and variational unitaries
Uvar (φi ) taking as input rotation angles φi (part of a vector φ ∈ [0, 2π]|φ| of dimension |φ|).
At the core of our parametrized quantum policies is a PQC defined by a unitary U (s, θ) that acts on
a fixed n-qubit state (e.g., |0⊗n i). This unitary encodes an input state s ∈ Rd and is parametrized
by a trainable vector θ. Although different choices of PQCs are possible, throughout our numerical
experiments (Sec. 3 and 4.2), we consider so-called hardware-efficient PQCs [40] with an alternating-
layered architecture [28, 41]. This architecture is depicted in Fig. 2 and essentially consists in an
alternation of Denc encoding unitaries Uenc (composed of single-qubit rotations Rz , Ry ) and Denc + 1
variational unitaries Uvar (composed of single-qubit rotations Rz , Ry and entangling Ctrl-Z gates ).
For any given PQC, we define two families of policies, differing in how the final quantum states
|ψs,θ i = U (s, θ) |0⊗n i are used. In the RAW-PQC model, we exploit the probabilistic nature of
quantum measurements to define an RL policy. For |A| available actions to the RL agent, we partition
H in |A| disjoint subspaces (e.g., spanned by computational basis states) and associate a projection
P Pa
to each of these subspaces. The projective measurement associated to the observable O = a aPa
then defines our RAW-PQC policy πθ (a|s) = hPa is,θ . A limitation of this policy family however
is that it does not have a directly adjustable greediness (i.e., a control parameter that makes the
policy more peaked). This consideration arises naturally in an RL context where an agent’s policy
needs to shift from an exploratory behavior (i.e., close to uniform distribution) to a more exploitative
3
behavior (i.e., a peaked distribution). To remedy this limitation, we define the SOFTMAX -PQC model,
that applies an adjustable softmaxβ non-linear activation function on the expectation values hPa is,θ
measured on |ψs,θ i. Since the softmax function normalizes any real-valued input, we can generalize
the projections Pa to be arbitrary Hermitian operators Oa . We also generalize these observables one
step further by assigning them trainable weights. The two models are formally defined below.
Definition 1 (RAW- and SOFTMAX -PQC). Given a PQC acting on n qubits, taking as input a state
s ∈ Rd , rotation angles φ ∈ [0, 2π]|φ| and scaling parameters λ ∈ R|λ| , such that its corresponding
unitary U (s, φ, λ) produces the quantum state |ψs,φ,λ i = U (s, φ, λ) |0⊗n i, we define its associated
RAW-PQC policy as:
πθ (a|s) = hPa is,θ (2)
where hPa is,θ = hψs,φ,λ |Pa |ψs,φ,λ i is the expectation value of a projection Pa associated to action
P
a, such that a Pa = I and Pa Pa0 = δa,a0 . θ = (φ, λ) constitute all of its trainable parameters.
Using the same PQC, we also define a SOFTMAX -PQC policy as:
eβhOa is,θ
πθ (a|s) = P (3)
a0 eβhOa0 is,θ
P
where hOa is,θ = hψs,φ,λ | i wa,i Ha,i |ψs,φ,λ i is the expectation value of the weighted Hermitian
operators Ha,i associated to action a, β ∈ R is an inverse-temperature parameter and θ = (φ, λ, w).
Note that we adopt here a very general definition for the observables Oa of our SOFTMAX -PQC
policies. As we discuss in more detail in Appendix C, very expressive trainable observables can in
some extreme cases take over all training of the PQC parameters φ, λ and render the role of the PQC
in learning trivial. However, in practice,
P as well as in our numerical experiments, we only consider
very restricted observables Oa = i wa,i Ha,i , where Ha,i are (tensor products of) Pauli matrices or
high-rank projections on computational basis states, which do not allow for these extreme scenarios.
In our PQC construction, we include trainable scaling parameters λ, used in every encoding gate to
re-scale its input components. This modification to the standard data encoding in PQCs comes in
light of recent considerations on the structure of PQC functions [42]. These additional parameters
allow to represent functions with a wider and richer spectrum of frequencies, and hence provide
shallow PQCs with more expressive power.
In order to analyze the properties of our PQC policies without the interference of other learning
mechanisms [43], we train these policies using the basic Monte Carlo policy gradient algorithm
REINFORCE [44, 45] (see Alg. 1). hPThis algorithm
i consists in evaluating Monte Carlo estimates of
H−1 t
the value function Vπθ (s0 ) = Eπθ t=0 γ rt , γ ∈ [0, 1], using batches of interactions with the
environment, and updating the policy parameters θ via a gradient ascent on Vπθ (s0 ). The resulting
updates (see line 8 of Alg. 1) involve the gradient of the log-policy ∇θ log πθ (a|s), which we
therefore need to compute for our policies. We describe this computation in the following lemma.
Lemma 1. Given a SOFTMAX -PQC policy πθ , the gradient of its logarithm is given by:
X
0
∇θ log πθ (a|s) = β ∇θ hOa is,θ − 0
π θ (a |s)∇ θ hO a0i
s,θ . (4)
a
Partial derivatives with respect to observable weights are trivially given by ∂wa,i hOa is,θ =
hψs,φ,λ |Ha,i |ψs,φ,λ i (see Def. 1), while derivatives with respect to rotation angles ∂φi hOa is,θ
and scaling parameters1 ∂λi hOa is,θ can be estimated via the parameter-shift rule [46, 42]:
1
∂i hOa is,θ = hOa is,θ+ π ei − hOa is,θ− π ei , (5)
2 2 2
π
i.e., using the difference of two expectation values hOa is,θ0 with a single angle shifted by ± 2 .
For a RAW-PQC policy πθ , we have instead:
∇θ log πθ (a|s) = ∇θ hPa is,θ / hPa is,θ (6)
where the partial derivatives ∂φi hPa is,θ and ∂λi hPa is,θ can be estimated similarly to above.
1
Note that the parameters λ do not act as rotation angles. To compute the derivatives ∂λi,j hOa is,θ , one
should compute derivatives w.r.t. sj λi,j instead and apply the chain rule: ∂λi,j hOa is,θ = sj ∂sj λi,j hOa is,θ .
4
Algorithm 1: REINFORCE with PQC policies and value-function baselines
Input: a PQC policy πθ from Def. 1; a value-function approximator Veω
1 Initialize parameters θ and ω;
2 while True do
3 Generate N episodes {(s0 , a0 , r1 , . . . , sH−1 , aH−1 , rH )}i following πθ ;
4 for episode i in batch do
PH−t 0 (i)
5 Compute the returns Gi,t ← t0 =1 γ t rt+t0 ;
(i) (i)
6 Compute the gradients ∇θ log πθ (at |st ) using Lemma 1;
(i)
7 Fit Veω (st ) i,t to the returns {Gi,t }i,t ;
1 P P
N H−1
(i) (i) (i)
8 Compute ∆θ = ∇θ log πθ (at |st ) Gi,t − Veω (st ) ;
N i=1 t=0
9 Update θ ← θ + α∆θ;
In some of our environments, we additionally rely on a linear value-function baseline to reduce the
variance of the Monte Carlo estimates [47]. We choose it to be identical to that of Ref. [48].
A natural consideration when it comes to the implementation of our PQC policies is whether one can
efficiently (in the number of executions of the PQC on a quantum computer) sample and train them.
By design, sampling from our RAW-PQC policies can be done with a single execution (and P mea-
surement) of the PQC: the projective measurement corresponding to the observable O = a aPa
naturally samples a basis state associated to action a with probability hPa is,θ . However, as Eq. (6)
indicates, in order to train these policies using REINFORCE, one is nonetheless required to estimate
the expectation values hPa is,θ , along with the gradients ∇θ hPa is,θ . Fortunately, these quantities
can be estimated efficiently up to some additive error ε, using only O(ε−2 ) repeated executions and
measurements on a quantum computer.
In the case of our SOFTMAX -PQC policies, it is less clear whether similar noisy estimates hO fa is,θ
of the expectation values hOa is,θ are sufficient to evaluate policies of the form of Eq. (3). We show
however that, using these noisy estimates, we can compute a policy π eθ that produces samples close to
that of the true policy πθ . We state our result formally in the following lemma, proven in Appendix B.
Lemma 2. For a SOFTMAX -PQC policy πθ defined by a unitary U (s, θ) and observables Oa ,
fa is,θ approximations of the true expectation values hOa i with at most ε additive error.
call hO s,θ
Then the approximate policy π eθ = softmaxβ (hO fa is,θ ) has total variation distance O(βε) to πθ =
softmaxβ (hOa is,θ ). Since expectation values can be efficiently estimated to additive error on a
quantum computer, this implies efficient approximate sampling from πθ .
We also obtain a similar result for the log-policy gradient of SOFTMAX -PQCs (see Lemma 1), that
we show can be efficiently estimated to additive error in `∞ -norm (see Appendix B for a proof).
5
CartPole-v1 (n = 4, Denc = 1) MountainCar-v0 (n = 2, Denc = 4) Acrobot-v1 (n = 6, Denc = 2)
-150
400 -80
-200
-100
300 -250
-120 -300
200
-350
-140
-400
100 softmax-PQC -160 softmax-PQC softmax-PQC
-450
raw-PQC raw-PQC raw-PQC
0 -180 -500
0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000
Episode Episode Episode
Figure 3: Numerical evidence of the advantage of SOFTMAX -PQC over RAW-PQC in bench-
marking environments. The learning curves (20 agents per curve) of randomly-initialized
SOFTMAX -PQC agents (green curves) and RAW-PQC agents (red curves) in OpenAI Gym en-
vironments: CartPole-v1, MountainCar-v0, and Acrobot-v1. Each curve is temporally averaged with
a time window of 10 episodes. All agents have been trained using the REINFORCE algorithm (see
Alg. 1), with value-function baselines for the MountainCar and Acrobot environments.
500
CartPole - softmax-PQC -60
MountainCar - softmax-PQC Acrobot - softmax-PQC
Average collected rewards
-100
-80 -150
400
-200
-100
300 -250
-120
-300
200
-140 -350
Depth 5 Depth 6 Depth 5
Reference (depth 1) Reference (depth 4) -400 Reference (depth 2)
100
-160
Fixed lambdas Fixed lambdas -450 Fixed lambdas
Fixed weights, β = 2, 10 Fixed weights, β = 2, 10 Fixed weights, β = 2, 10
0 -180 -500
0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000
Episode Episode Episode
Figure 4: Influence of the model architecture for SOFTMAX -PQC agents. The blue curves in
each plot correspond to the learning curves from Fig. 3 and are taken as a reference. Other curves
highlight the influence of individual hyperparameters. For RAW-PQC agents, see Appendix E.
The results of the previous subsection however do not indicate whether other design choices we have
made in Sec. 2.2 had an influence on the performance of our quantum agents. To address this, we
run a second set of experiments, presented in Fig. 4. In these simulations, we evaluate the average
performance of our SOFTMAX -PQC agents after modifying one of three design choices: we either
increment the depth of the PQC (until no significant increase in performance is observed), fix the
input-scaling parameters λ to 1, or fix the observable weights w to 1. By comparing the performance
of these agents with that of the agents from Fig. 3, we can make the following observations:
6
• Influence of depth: Increasing the depth of the PQC generally improves (not strictly) the perfor-
mance of the agents. Note that the maximum depth we tested was Denc = 10.
• Influence of scaling parameters λ: We observe that training these scaling parameters in general
benefits the learning performance of our PQC policies, likely due to their increased expressivity.
• Influence of trainable observable weights w: our final consideration relates to the importance of
having a policy with “trainable greediness” in RL scenarios. For this, we consider SOFTMAX -PQC
agents with fixed observables βOa throughout training. We observe that this has the general effect
of decreasing the performance and/or the speed of convergence of the agents. We also see that
policies with fixed high β (or equivalently, a large observable norm βkOa k) tend to have a poor
learning performance, likely due to their lack of exploration in the RL environments.
Finally, note that all the numerical simulations performed here did not include any source of noise in
the PQC evaluations. It would be an interesting research direction to assess the influence of (simulated
or hardware-induced) noise on the learning performance of PQC agents.
SL-DLP Our objective is to show that analogous separations between classical and quantum
learners can be established for RL environments, in terms of their attainable value functions. We start
by pointing out that supervised learning (SL) tasks (and so the classification problem of Liu et al.) can
be trivially embedded into RL environments [50]: for a given concept fs , the states x are datapoints,
an action a is an agent’s guess on the label of x, an immediate reward specifies if it was correct
(i.e., fs (x) = a), and subsequent states are chosen uniformly at random. In such settings, the value
function is trivially related to the testing accuracy of the SL problem, yielding a direct reduction of
the separation result of Liu et al. [14] to an RL setting. We call this family of environments SL-DLP.
Cliffwalk-DLP In the SL-DLP construction, we made the environment fully random in order to
simulate the process of obtaining i.i.d. samples in an SL setting. It is an interesting question whether
similar results can be obtained for environments that are less random, and endowed with temporal
structure, which is characteristic of RL. In our second family of environments (Cliffwalk-DLP),
7
we supplement the SL-DLP construction with next-state transitions inspired by the textbook “cliff
walking” environment of Sutton & Barto [44]: all states are ordered in a chain and some actions of the
agent can lead to immediate episode termination. We keep however stochasticity in the environment
by allowing next states to be uniformly sampled, with a certain probability δ (common in RL to
ensure that an agent is not simply memorizing a correct sequence of actions). This allows us to show
that, as long as sufficient randomness is provided, we still have a simple classical-quantum separation.
Deterministic-DLP In the two families constructed above, each environment instance provided
the randomness needed for a reduction from the SL problem. This brings us to the question of
whether separations are also possible for fully deterministic environments. In this case, it is clear
that for any given environment, there exists an efficient classical agent which performs perfectly
over any polynomial horizon (a lookup-table will do). However, we show in our third family of
environments (Deterministic-DLP) that a separation can still be attained by moving the randomness
to the choice of the environment itself: assuming an efficient classical agent is successful in most
of exponentially-many randomly generated (but otherwise deterministic) environments, implies the
existence of a classical efficient algorithm for DLP.
We summarize our results in the following theorem, detailed and proven in Appendices G through I.
Theorem 1. There exist families of reinforcement learning environments which are: i) fully random
(i.e., subsequent states are independent from the previous state and action); ii) partially random
(i.e., the previous moves determine subsequent states, except with a probability δ at least 0.86 where
they are chosen uniformly at random), and iii) fully deterministic; such that there exists a separation
in the value functions achievable by a given quantum polynomial-time agent and any classical
polynomial-time agent. Specifically, the value of the initial state for the quantum agent Vq (s0 ) is
ε−close to the optimal value function (for a chosen ε, and with probability above 2/3). Further, if
there exists a classical efficient learning agent that achieves a value Vc (s0 ) better than Vrand (s0 ) + ε0
(for a chosen ε0 , and with probability above 0.845), then there exists a classical efficient algorithm
to solve DLP. Finally, we have Vq (s0 ) − Vc (s0 ) larger than some constant, which depends on the
details of the environment.
The remaining point we need to address here is that the learning agents of Liu et al. do not rely on
PQCs but rather support vector machines (SVMs) based on quantum kernels [6, 7]. Nonetheless,
using a connection between these quantum SVMs and PQCs [7], we construct PQC policies which
are as powerful in solving the DLP environments as the agents of Liu et al. (even under similar noise
considerations). We state our result in the following informal theorem, that we re-state formally,
along with the details of our construction in Appendices J and K.
Theorem 2 (informal version). Using a training set of size polynomial in n = log(p) and a number
of (noisy) quantum circuit evaluations also polynomial in n, we can train a PQC classifier on the
DLP task of Liu et al. of size n that achieves a testing accuracy arbitrarily close to optimal, with high
probability. This PQC classifier can in turn be used to construct close-to-optimal quantum agents in
our DLP environments, as prescribed by Theorem 1.
While the DLP environments establish a proof of the learning advantage PQC policies can have in
theory, these environments remain extremely contrived and artificial. They are based on algebraic
properties that agents must explicitly decrypt in order to perform well. Instead, we would like to
consider environments that are less tailored to a specific decryption function, which would allow
more general agents to learn. To do this, we take inspiration from the work of Havlíček et al. [6], who,
in order to test their PQC classifiers, define a learning task generated by similar quantum circuits.
8
(a) PQC labeling function 20.0
(b) SL-PQC 20.0
(c) Cliffwalk-PQC
Figure 5: Numerical evidence of the advantage of PQC policies over DNN policies in PQC-
generated environments. (a) Labeling function and training data used for both RL environments.
The data labels (red for +1 label and blue for −1 label) are generated using a RAW-PQC of depth
Denc = 4 with a margin ∆ = 0.3 (white areas). The training samples are uniformly sampled from
the blue and red regions, and arrows indicate the rewarded path of the cliffwalk environment. (b) and
(c) The learning curves (20 agents per curve) of randomly-initialized SOFTMAX -PQC agents and
DNN agents in RL environments where input states are (b) uniformly sampled from the dataset and
(c) follow cliffwalk dynamics. Each curve is temporally averaged with a time window of 10 episodes.
This dataset allows us to define two RL environments, similar to the SL-DLP and Cliffwalk-DLP
environments of Sec. 4.1:
9
5 Conclusion
In this work, we have investigated the design of quantum RL agents based on PQCs. We proposed
several constructions and showed the impact of certain design choices on learning performance. In
particular, we introduced the SOFTMAX -PQC model, where a softmax policy is computed from
expectation values of a PQC with both trainable observable weights and input scaling parameters.
These added features to standard PQCs used in ML (e.g., as quantum classifiers) enhance both the
expressivity and flexibility of PQC policies, which allows them to achieve a learning performance on
benchmarking environments comparable to that of standard DNNs. We additionally demonstrated the
existence of task environments, constructed out of PQCs, that are very natural for PQC agents, but on
which DNN agents have a poor performance. To strengthen this result, we constructed several RL
environments, each with a different degree of degeneracy (i.e., closeness to a supervised learning
task), where we showed a rigorous separation between a class of PQC agents and any classical
learner, based on the widely-believed classical hardness of the discrete logarithm problem. We
believe that our results constitute strides toward a practical quantum advantage in RL using near-term
quantum devices.
6 Broad impact
We expect our work to have an overall positive societal impact. Notably, we believe that our approach
to QRL could be beneficial in the two following ways:
At the same time, our work may have certain negative consequences. Notably, QRL will inherit
many of the problems that are already present in classical RL and ML in general. For instance, it
is not clear whether the question of interpretability of learning models [58] will be negatively or
positively impacted by switching to quantum models. One could argue that the inability to fully
access the quantum Hilbert spaces in which quantum computers operate can turn learning models
even further into “black-boxes” than existing classical models. Also, similarly to the fact that current
state-of-the-art ML/RL requires supercomputers that are not accessible to everyone, private and select
access to quantum computers could emphasize existing inequalities in developing and using AI.
The authors would like to thank Srinivasan Arunachalam for clarifications on the testing accuracy of
their quantum classifier in the DLP classification task. The authors would also like to thank Andrea
Skolik and Arjan Cornelissen for helpful discussions and comments. CG thanks Thomas Moerland
for discussions in the early phases of this project. SJ and HJB acknowledge support from the Austrian
Science Fund (FWF) through the projects DK-ALM:W1259-N27 and SFB BeyondC F7102. SJ
also acknowledges the Austrian Academy of Sciences as a recipient of the DOC Fellowship. This
work was in part supported by the Dutch Research Council (NWO/OCW), as part of the Quantum
Software Consortium program (project number 024.003.037). VD and SM acknowledge the support
by the project NEASQC funded from the European Union’s Horizon 2020 research and innovation
programme (grant agreement No 951821). VD and SM also acknowledge partial funding by an
unrestricted gift from Google Quantum AI. The computational results presented here have been
achieved in part using the LEO HPC infrastructure of the University of Innsbruck.
10
References
[1] John Preskill. Quantum computing in the nisq era and beyond. Quantum, 2:79, 2018.
[2] Kishor Bharti, Alba Cervera-Lierta, Thi Ha Kyaw, Tobias Haug, Sumner Alperin-Lea, Abhinav
Anand, Matthias Degroote, Hermanni Heimonen, Jakob S Kottmann, Tim Menke, et al. Noisy
intermediate-scale quantum (nisq) algorithms. arXiv preprint arXiv:2101.08448, 2021.
[3] Marcello Benedetti, Erika Lloyd, Stefan Sack, and Mattia Fiorentini. Parameterized quantum
circuits as machine learning models. Quantum Science and Technology, 4(4):043001, 2019.
[4] Edward Farhi and Hartmut Neven. Classification with quantum neural networks on near term
processors. arXiv preprint arXiv:1802.06002, 2018.
[5] Maria Schuld, Alex Bocharov, Krysta M Svore, and Nathan Wiebe. Circuit-centric quantum
classifiers. Physical Review A, 101(3):032308, 2020.
[6] Vojtěch Havlíček, Antonio D Córcoles, Kristan Temme, Aram W Harrow, Abhinav Kandala,
Jerry M Chow, and Jay M Gambetta. Supervised learning with quantum-enhanced feature
spaces. Nature, 567(7747):209–212, 2019.
[7] Maria Schuld and Nathan Killoran. Quantum machine learning in feature hilbert spaces.
Physical review letters, 122(4):040504, 2019.
[8] Evan Peters, Joao Caldeira, Alan Ho, Stefan Leichenauer, Masoud Mohseni, Hartmut Neven,
Panagiotis Spentzouris, Doug Strain, and Gabriel N Perdue. Machine learning of high dimen-
sional data on a noisy quantum processor. arXiv preprint arXiv:2101.09581, 2021.
[9] Jin-Guo Liu and Lei Wang. Differentiable learning of quantum circuit born machines. Physical
Review A, 98(6):062324, 2018.
[10] Daiwei Zhu, Norbert M Linke, Marcello Benedetti, Kevin A Landsman, Nhung H Nguyen,
C Huerta Alderete, Alejandro Perdomo-Ortiz, Nathan Korda, A Garfoot, Charles Brecque, et al.
Training of quantum circuits on a hybrid quantum computer. Science advances, 5(10):eaaw9918,
2019.
[11] JS Otterbach, R Manenti, N Alidoust, A Bestwick, M Block, B Bloom, S Caldwell, N Didier,
E Schuyler Fried, S Hong, et al. Unsupervised machine learning on a hybrid quantum computer.
arXiv preprint arXiv:1712.05771, 2017.
[12] Hsin-Yuan Huang, Michael Broughton, Masoud Mohseni, Ryan Babbush, Sergio Boixo, Hart-
mut Neven, and Jarrod R McClean. Power of data in quantum machine learning. Nature
communications, 12(1):1–9, 2021.
[13] Yuxuan Du, Min-Hsiu Hsieh, Tongliang Liu, and Dacheng Tao. Expressive power of
parametrized quantum circuits. Physical Review Research, 2(3):033125, 2020.
[14] Yunchao Liu, Srinivasan Arunachalam, and Kristan Temme. A rigorous and robust quantum
speed-up in supervised machine learning. Nature Physics, 17(9):1013–1017, 2021.
[15] Ryan Sweke, Jean-Pierre Seifert, Dominik Hangleiter, and Jens Eisert. On the quantum versus
classical learnability of discrete distributions. Quantum, 5:417, 2021.
[16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G
Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.
Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
[17] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur
Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of
go without human knowledge. Nature, 550(7676):354, 2017.
[18] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ebiak, Christy
Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large
scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
[19] Piotr Mirowski, Matt Grimes, Mateusz Malinowski, Karl Moritz Hermann, Keith Anderson,
Denis Teplyashin, Karen Simonyan, Andrew Zisserman, Raia Hadsell, et al. Learning to
navigate in cities without a map. Advances in Neural Information Processing Systems, 31:
2419–2430, 2018.
[20] Samuel Yen-Chi Chen, Chao-Han Huck Yang, Jun Qi, Pin-Yu Chen, Xiaoli Ma, and Hsi-
Sheng Goan. Variational quantum circuits for deep reinforcement learning. IEEE Access, 8:
141007–141024, 2020.
[21] Owen Lockwood and Mei Si. Reinforcement learning with quantum variational circuit. In Pro-
ceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment,
volume 16, pages 245–251, 2020.
[22] Shaojun Wu, Shan Jin, Dingding Wen, and Xiaoting Wang. Quantum reinforcement learning in
continuous action space. arXiv preprint arXiv:2012.10711, 2020.
11
[23] Sofiene Jerbi, Lea M. Trenkwalder, Hendrik Poulsen Nautrup, Hans J. Briegel, and Vedran
Dunjko. Quantum enhancements for deep reinforcement learning in large spaces. PRX Quantum,
2:010328, Feb 2021.
[24] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,
and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
[25] Peter W Shor. Polynomial-time algorithms for prime factorization and discrete logarithms on a
quantum computer. SIAM review, 41(2):303–332, 1999.
[26] Manuel Blum and Silvio Micali. How to generate cryptographically strong sequences of
pseudorandom bits. SIAM journal on Computing, 13(4):850–864, 1984.
[27] Andrea Skolik, Sofiene Jerbi, and Vedran Dunjko. Quantum agents in the gym: a variational
quantum algorithm for deep q-learning. arXiv preprint arXiv:2103.15084, 2021.
[28] Adrián Pérez-Salinas, Alba Cervera-Lierta, Elies Gil-Fuster, and José I Latorre. Data re-
uploading for a universal quantum classifier. Quantum, 4:226, 2020.
[29] Daoyi Dong, Chunlin Chen, Hanxiong Li, and Tzyh-Jong Tarn. Quantum reinforcement learning.
IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38(5):1207–1220,
2008.
[30] Giuseppe Davide Paparo, Vedran Dunjko, Adi Makmal, Miguel Angel Martin-Delgado, and
Hans J Briegel. Quantum speedup for active learning agents. Physical Review X, 4(3):031002,
2014.
[31] Vedran Dunjko, Jacob M Taylor, and Hans J Briegel. Quantum-enhanced machine learning.
Physical review letters, 117(13):130501, 2016.
[32] Daniel Crawford, Anna Levit, Navid Ghadermarzy, Jaspreet S Oberoi, and Pooya Ronagh. Rein-
forcement learning using quantum boltzmann machines. Quantum Information & Computation,
18(1-2):51–74, 2018.
[33] Florian Neukart, David Von Dollen, Christian Seidel, and Gabriele Compostella. Quantum-
enhanced reinforcement learning for finite-episode games with discrete state spaces. Frontiers
in Physics, 5:71, 2018.
[34] Lov K Grover. A fast quantum mechanical algorithm for database search. In Proceedings of the
twenty-eighth annual ACM symposium on Theory of computing, pages 212–219, 1996.
[35] Mark W Johnson, Mohammad HS Amin, Suzanne Gildert, Trevor Lanting, Firas Hamze, Neil
Dickson, Richard Harris, Andrew J Berkley, Jan Johansson, Paul Bunyk, et al. Quantum
annealing with manufactured spins. Nature, 473(7346):194–198, 2011.
[36] TensorFlow Quantum. Parametrized quantum circuits for reinforcement learning.
URL : tensorflow.org/quantum/tutorials/quantum_reinforcement_learning, 2021.
[37] Michael Broughton, Guillaume Verdon, Trevor McCourt, Antonio J Martinez, Jae Hyeon Yoo,
Sergei V Isakov, Philip Massey, Murphy Yuezhen Niu, Ramin Halavati, Evan Peters, et al.
Tensorflow quantum: A software framework for quantum machine learning. arXiv preprint
arXiv:2003.02989, 2020.
[38] Michael A. Nielsen and Isaac L. Chuang. Quantum Computation and Quantum Information.
Cambridge University Press, 2000.
[39] Ronald De Wolf. Quantum computing: Lecture notes. arXiv preprint arXiv:1907.09415, 2019.
[40] Abhinav Kandala, Antonio Mezzacapo, Kristan Temme, Maika Takita, Markus Brink, Jerry M
Chow, and Jay M Gambetta. Hardware-efficient variational quantum eigensolver for small
molecules and quantum magnets. Nature, 549(7671):242–246, 2017.
[41] Maria Schuld, Ryan Sweke, and Johannes Jakob Meyer. Effect of data encoding on the
expressive power of variational quantum-machine-learning models. Physical Review A, 103(3):
032430, 2021.
[42] Maria Schuld, Ville Bergholm, Christian Gogolin, Josh Izaac, and Nathan Killoran. Evaluating
analytic gradients on quantum hardware. Physical Review A, 99(3):032331, 2019.
[43] Lilian Weng. Policy gradient algorithms. URL : lilianweng.github.io/lil-log, 2018.
[44] Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. 1998.
[45] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-
ment learning. Machine learning, 8(3-4):229–256, 1992.
[46] Kosuke Mitarai, Makoto Negoro, Masahiro Kitagawa, and Keisuke Fujii. Quantum circuit
learning. Physical Review A, 98(3):032309, 2018.
[47] Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for
gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):
1471–1530, 2004.
12
[48] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep
reinforcement learning for continuous control. In International conference on machine learning,
pages 1329–1338. PMLR, 2016.
[49] OpenAI. Leaderboard of openai gym environments. URL : github.com/openai/gym/wiki, 2020.
[50] Vedran Dunjko, Yi-Kai Liu, Xingyao Wu, and Jacob M Taylor. Exponential improvements for
quantum-accessible reinforcement learning. arXiv preprint arXiv:1710.11160, 2017.
[51] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C Bardin, Rami Barends, Rupak
Biswas, Sergio Boixo, Fernando GSL Brandao, David A Buell, et al. Quantum supremacy using
a programmable superconducting processor. Nature, 574(7779):505–510, 2019.
[52] Benjamin Villalonga, Dmitry Lyakh, Sergio Boixo, Hartmut Neven, Travis S Humble, Rupak
Biswas, Eleanor G Rieffel, Alan Ho, and Salvatore Mandrà. Establishing the quantum supremacy
frontier with a 281 pflop/s simulation. Quantum Science and Technology, 5(3):034003, 2020.
[53] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.
The International Journal of Robotics Research, 32(11):1238–1274, 2013.
[54] Mufti Mahmud, Mohammed Shamim Kaiser, Amir Hussain, and Stefano Vassanelli. Appli-
cations of deep learning and reinforcement learning to biological data. IEEE transactions on
neural networks and learning systems, 29(6):2063–2079, 2018.
[55] Chao Yu, Jiming Liu, and Shamim Nemati. Reinforcement learning in healthcare: A survey.
arXiv preprint arXiv:1908.08796, 2019.
[56] Francisco Albarrán-Arriagada, Juan C Retamal, Enrique Solano, and Lucas Lamata.
Measurement-based adaptation protocol with quantum reinforcement learning. Physical Review
A, 98(4):042315, 2018.
[57] Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J
Love, Alán Aspuru-Guzik, and Jeremy L O’brien. A variational eigenvalue solver on a photonic
quantum processor. Nature communications, 5(1):1–7, 2014.
[58] Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable ai: A review
of machine learning interpretability methods. Entropy, 23(1):18, 2021.
[59] Takahiro Goto, Quoc Hoan Tran, and Kohei Nakajima. Universal approximation property
of quantum machine learning models in quantum-enhanced feature spaces. Physical Review
Letters, 127(9):090506, 2021.
[60] Adrián Pérez-Salinas, David López-Núñez, Artur García-Sáez, Pol Forn-Díaz, and José I
Latorre. One qubit as a universal approximant. Physical Review A, 104(1):012405, 2021.
[61] Google. Cirq: A python framework for creating, editing, and invoking noisy intermediate scale
quantum circuits. URL : github.com/quantumlib/Cirq, 2018.
[62] Yasunari Suzuki, Yoshiaki Kawase, Yuya Masumura, Yuria Hiraga, Masahiro Nakadai, Jiabao
Chen, Ken M Nakanishi, Kosuke Mitarai, Ryosuke Imai, Shiro Tamiya, et al. Qulacs: a fast
and versatile quantum circuit simulator for research purpose. arXiv preprint arXiv:2011.13524,
2020.
13
Supplementary Material for:
Parametrized Quantum Policies for Reinforcement Learning
Sofiene Jerbi, Casper Gyurik, Simon C. Marshall, Hans J. Briegel, Vedran Dunjko
14
Due to the monoticity of the exponential, we have, for all a:
e−βε eβhOa is,θ eβhOa is,θ eβε eβhOa is,θ
f
P i
≤ P ≤ P
βhO
eβε a0 e a0 s,θ a0 e
βhOa0 is,θ
g
e−βε a0 eβhOa0 is,θ
⇔ e−2βε πθ (a|s) ≤ eθ (a|s)
π ≤ e2βε πθ (a|s). (7)
Hence,
X
eθ ) =
TV(πθ , π |e
πθ (a|s) − πθ (a|s)|
a
X
≤ e2βε πθ (a|s) − e−2βε πθ (a|s)
a
X
= e2βε − e−2βε πθ (a|s)
a
= 2|sinh(2βε)| = 4βε + O (βε)3 ,
βε→0+
Using a similar approach to the proof of the previous section, we show the following lemma:
Lemma 3. For a SOFTMAX -PQC policy πθ defined by a unitary U (s, θ) and observables Oa , call
fa is,θ approximations of the true derivatives ∂i hOa is,θ with at most ε additive error, and hO
∂i hO fa is,θ
approximations of the true expectation values hOa is,θ with at most ε = ε(4β maxa kOa k)−1
0
additive error. Then the approximate log-policy gradient ∇θ log π fθ (a|s) = β ∇θ hO fa is,θ −
P
0 π
g
fθ (a0 |s)∇θ hOa0 is,θ has distance O(βε) to ∇θ log πθ (a|s) in `∞ -norm.
a
Proof. Call xa,i = πθ (a|s)∂i hOa is,θ and x fa is,θ , such that:
eθ (a|s)∂i hO
ea,i = π
X
fθ (a|s) = β ∂i hO
∂i log π fa is,θ − ea0 ,i .
x
a0
We also have
X X
xa,i = πθ (a|s)∂i hOa is,θ ≤ max |∂i hOa is,θ | ≤ max kOa k
a a a,i a
15
where the last inequality derives from the parameter-shift rule (Eq. (5)) formulation of ∂i hOa i for
derivatives w.r.t. rotation angles of the PQC and the fact that ∂i hOa i are simply expectation values
hHa,i i with kHa,i k ≤ kOa k for observable weights.
Applying the triangular inequality on the right side of Eq. (8), we hence have:
X X
xa,i − ea,i ≤ ε + 4βε0 max kOa k + O (βε0 )2 ε + O (βε0 )3 .
x
a a
βε0 →0+ a
16
D Environments specifications and hyperpameters
In Table 1, we present a specification of the environments we consider in our numerical simulations.
These are standard benchmarking environments from the OpenAI Gym library [24], described in Ref.
[49], PQC-generated environments that we define in Sec. 4.2, and the CognitiveRadio environment
of Ref. [20] that we discuss in Appendix E.
State Number of
Environment Horizon Reward function Termination conditions
dimension actions
−1 + height
MountainCar-v0 2 3 200 Reaching goal or horizon
until termination
Table 1: Environments specifications. The reward function of Mountaincar-v0 has been modified
compared to the standard specification of OpenAI Gym [24], similarly to Ref. [48].
In Tables 2 and 3, we list the hyperparameters used to train our agents on the various environments
we consider. All agents use an ADAM optimizer. For the plots presented in this manuscript, all
quantum circuits were implemented using the Cirq library [61] in Python and simulated using a
Qulacs backend [62] in C++. For the tutorial [36], the TensorFlow Quantum library [37] was used.
All simulations were run on the LEO cluster (more than 3000 CPUs) of the University of Innsbruck,
with an estimated total compute time (including hyperparametrization) of 20 000 CPU-hours.
In Fig. 6, we run a similar experiment to that of Sec. 3.2 in the main text, but on RAW-PQC agents
instead of SOFTMAX -PQC agents. We observe that both increasing the depth of the PQCs and
training the scaling parameters λ have a similar positive influence on the learning performance, and
even more pronounced than for SOFTMAX -PQC agents. Nonetheless, we also observe that, even
at greater depth, the final performance, as well as the speed of convergence, of RAW-PQC agents
remain limited compared to that of SOFTMAX -PQC agents.
500
CartPole - raw-PQC -60
MountainCar - raw-PQC -100
Acrobot - raw-PQC
Average collected rewards
-150
-80
400
-200
-100
300 -250
-120 -300
200 -350
-140
Depth 5 Depth 6 -400 Depth 5
100
Reference (depth 1) -160 Reference (depth 4) -450 Reference (depth 2)
Fixed lambdas Fixed lambdas Fixed lambdas
0 -180 -500
0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000
Episode Episode Episode
Figure 6: Influence of the model architecture for RAW-PQC agents. The blue curves in each plot
correspond to the learning curves from Fig. 3 and are taken as a reference.
17
Learning Discount Final Batch
Environment Model Depth Width
rates γ β size
Table 2: Hyperparmeters 1/2. For PQC policies, we choose 3 distinct learning rates [αφ , αw , αλ ]
for rotation angles φ, observable weights w and scaling parameters λ, respectively. For SOFTMAX -
PQCs, we take a linear annealing schedule for the inverse temperature parameter β starting from 1
and ending up in the final β. The batch size is counted in number of episodes used to evaluate the
gradient of the value function. Depth indicates the number of encoding layers Denc for PQC policies,
or the number of hidden layers for a DNN policy. Width corresponds to the number of qubits n on
which acts a PQC (also equal to the dimension d of the environment’s state space), or the number of
units per hidden layer for a DNN.
Number
Entang. Train
Environment Model Observables of Baseline
topology entang.
params.
Table 3: Hyperparmeters 2/2. We call entangling layer a layer of 2-qubit gates in the PQC. Circular
and all-to-all topologies of entangling layers are equivalent for n = 2 qubits, so we call them one-
to-one in that case. When trained, entangling layers are composed of Rzz = e−iθ(Z⊗Z)/2 rotations,
otherwise, they are composed of Ctrl-Z gates. For policies with 2 actions, the same observable, up
to a sign change, is used for both actions. Zi refers to a Pauli-Z observable acting on qubit i, while
Pi..j indicates a projection on basis states i to j. In the experiments of Sec. 3.2, when the weights
of the SOFTMAX -PQC are kept fixed, the observables used for MountainCar-v0 and Acrobot-v1
are [Z0 , Z0 Z1 , Z1 ], and those used for CartPole-v1 are [Z0 Z1 Z2 Z3 , −Z0 Z1 Z2 Z3 ]. The different
number of parameters in a given row correspond to the different depths in that same row in Table 2.
18
E.2 Shape of the policies learned by PQCs v.s. DNNs
In CartPole-v1 The results of the Sec. 3 demonstrate that our PQC policies can be trained to good
performance in benchmarking environments. To get a feel of the solutions found by our agents, we
compare the SOFTMAX -PQC policies learned on CartPole to those learned by standard DNNs (with
a softmax output layer), which are known to easily learn close-to-optimal behavior on this task. More
specifically, we look at the functions learned by these two models, prior to the application of the
softmax normalization function (see Eq. (3)). Typical instances of these functions are depicted in
Figure 8. We observe that, while DNNs learn simple, close to piece-wise linear functions of their
input state space, PQCs tend to naturally learn very oscillating functions that are more prone to
instability. While the results of Schuld et al. [42] already indicated that these highly oscillating
functions would be natural for PQCs, it is noteworthy to see that these are also the type of functions
naturally learned in a direct-policy RL scenario. Moreover, our enhancements to standard PQC
classifiers show how to make these highly oscillating functions more amenable to real-world tasks.
In PQC-generated environments Fig. 9 shows the analog results to Fig. 5 in the main text but
with two different random initializations of the environment-generating PQC. Both confirm our
observations. In Fig. 10, we compare the policies learned by prototypical SOFTMAX -PQC and DNN
agents in these PQC-generated environments. We observe that the typical policies learned by DNNs
are rather simple, with up to 2 (or 3) regions, delimited by close-to-linear boundaries, as opposed to
the policies learned by SOFTMAX -PQCs, which delimit red from blue regions with wide margins.
These observations highlight the inherent flexibility of SOFTMAX -PQC policies and their suitability
to these PQC-generated environments, as opposed to the DNN (and RAW-PQC) policies we consider.
In a related work on value-based RL with PQCs, the authors of Ref. [20] introduced the CognitiveRa-
dio environment as a benchmark to test their RL agents. In this environment, the agent is presented at
each interaction step with a binary vector (0, 0, 0, 1, 0) of size n that describes the occupation of n
radio channels. Given this state, the agent must select one of the n channels as its communication
channel, such as to avoid collision with occupied channels (a ±1 reward reflects these collisions). The
authors of Ref. [20] consider a setting where, in any given state, only one channel is occupied, and
its assignment changes periodically over time steps, for an episode length of 100 steps. While this
constitutes a fairly simple task environment with discrete state and action spaces, it allows to test the
performance of PQC agents on a family of environments described by their system size n and make
claims on the parameter complexity of the PQCs as a function of n. As to reproduce the findings of
Ref. [20] in a policy-gradient setting, we test the performance of our SOFTMAX -PQC agents on this
environment. We find numerically (see Fig. 7) that these achieve a very similar performance to the
PQC agents of Ref. [20] on the same system sizes they consider (n = 2 to 5), using PQCs with the
same scaling of number of parameters, i.e., O(n).
100
2 channels 100
3 channels 100
4 channels 100
5 channels
Average collected rewards
80 80 80 80
60 60 60 60
40 40 40 40
20 20 20 20
Figure 7: Performance of our SOFTMAX -PQC agents on the CognitiveRadio environment pro-
posed in Ref. [20]. Average performance of 20 agents for system sizes (and number of qubits) n = 2
to 5.
19
y
y
Unnormalized probabilit
Unnormalized probabilit
Unnormalized probabilit
2 8
4
6
2 1 4
0 0 2
-1 0
-2 -2
-2 -4
-4
-3 -6
2 2 2
ty
ci
1 1 1
lo
on
y
it
ve
ti
c
-0.4 0 -0.4 0 -0.4 0
lo
si
r
la
-0.2 -0.2 -0.2
po
ve
-1 -1 -1
gu
Pole 0.0 Pole 0.0 Pole 0.0
t
t
ar
ar
an
ang 0.2 -2 ang 0.2 -2 C ang 0.2 -2
C
le le le
le
0.4 0.4 0.4
Po
(a)
y
Unnormalized probabilit
Unnormalized probabili
6
10
4
20
2 5
10
0 0
0
-2 -5 -10
-10 -20
2
ty
1 -2 -0.4 -0.4
Po -2
on
-0.2 le -1 -0.2
ti
-0.4 0 Ca -1 an
si
-1 ve 0 gu 0
Pole 0.0 loc 1 an lar an
t
le 2 0.4 0.4
0.4 ity
(b)
Figure 8: Prototypical unnormalized policies learned by SOFTMAX -PQC agents and DNN
agents in CartPole. Due to the 4 dimensions of the state space in CartPole, we represent the
unnormalized policies learned by (a) SOFTMAX -PQC agents and (b) DNN agents on 3 subspaces of
the state space by fixing unrepresented dimensions to 0 in each plot. To get the probability of the
agent pushing the cart to the left, one should apply the logistic function (i.e., 2-dimensional softmax)
1/(1 + exp(−z)) to the z-axis values of each plot.
20
(a) PQC labeling function 20.0
(b) SL-PQC 20.0
(c) Cliffwalk-PQC
π π
0 π 0 π
0 2π 0 2π
(a) (b)
π π
0 π 0 π
0 2π 0 2π
(c) (d)
Figure 10: Prototypical policies learned by SOFTMAX -PQC agents and DNN agents in PQC-
generated environments. All policies are associated to the labeling function of Fig. 9.d. Policies
(a) and (b) are learned in the SL-PQC environment while policies (c) and (d) are learned in the
Cliffwalk-PQC environment.
21
F Supervised learning task of Liu et al.
Define p a large prime number, n = dlog2 (p − 1)e, and g a generator of Z∗p = {1, 2, . . . , p − 1} (i.e.,
a g ∈ Z∗p such that {g y , y ∈ Zp−1 } = Z∗p ). The DLP consists in computing logg x on input x ∈ Z∗p .
Based on DLP, Liu et al. [14] define a concept class C = {fs }s∈Zp−1 over the input space X = Z∗p ,
where each labeling function of this concept class is defined as follows:
+1, if logg x ∈ [s, s + p−32 ],
fs (x) = (9)
−1, otherwise.
Each function fs : Z∗p → {−1, 1} hence labels half the elements in Z∗p with a label +1 and the other
half with a label −1. We refer to Figure 1 in Ref. [14] for a good visualization of all these objects.
The performance of a classifier f is measured in terms of its testing accuracy
Accf (fs ) = Prx∼X [f (x) = fs (x)].
G Proof of Theorem 1
In the following, we provide constructions of a) fully random, b) partially random and c) fully
deterministic environments satisfying the properties of Theorem 1. We consider the three families of
environments separately and provide individual lemmas specifying their exact separation properties.
Fully random: the SL-DLP environment. This result is near-trivially obtained by noting that any
classification problem can be easily mapped to a (degenerate) RL problem. For this, the environment
will be an MDP defined as follows: its state space is the input space of the classification problem,
its action space comprises all possible labels, rewards are trivially +1 for assigning a correct label
to an input state and −1 otherwise, and the initial and next-state transition probabilities are state-
independent and equal to the input distribution of the classification task. The optimal policy of this
MDP is clearly the optimal classifier of the corresponding SL task. Consider now the classification
task of Liu et al., defined in detail in Appendix F: the input distribution is taken to be uniform on the
1
state space, i.e., P (st ) = |S| , and the performance of a classifier f w.r.t. a labeling (or ground truth)
∗
function f is measured in terms of a testing accuracy
1 X
Accf (f ∗ ) = Pr[f (s) = f ∗ (s)]. (10)
|S| s
For the MDP associated to this classification task and length-1 episodes of interaction, the value
function of any policy π(a|s) is given by
1 X
Vπ (s0 ) = (π(f ∗ (s0 )|s0 ) − π(−f ∗ (s0 )|s0 ))
|S| s
0
1 X
= 2π(f ∗ (s0 )|s0 ) − 1
|S| s
0
= 2Accπ (f ∗ ) − 1,
which is trivially related to the testing accuracy of this policy on the classification task. Note that we
also have Vrand (s0 ) = 0 and Vopt (s0 ) = 1.
Since these observations hold irrespectively of the labeling function f ∗ , we can show the following
result:
Lemma 4 (Quantum advantage in SL-DLP). There exists a uniform family of SL-DLP MDPs, each
derived from a labeling function f ∗ of the DLP concept class C (see Appendix F), for which classical
hardness and quantum learnability holds. More specifically, the performance of any classical learner
is upper bounded by 1/poly(n), while that of a class of quantum agents is lower bounded by 0.98
with probability above 2/3 (over the randomness of their interaction with the environment and noise
in their implementation).
22
testing accuracy Accπ (f ∗ ) ≤ 1/2 + 1/poly(n), and hence its value function would be Vπ (s0 ) ≤
1/poly(n).
For quantum learnability, we define an agent that first collects poly(n) random length-1 interactions
(i.e., a random state s0 and its associated reward for an action +1, from which the label f ∗ (s0 ) can be
inferred), and use Theorem 2 of Liu et al. to train a classifier that
has test accuracy at least 0.99 with
probability at least 2/3 (this process can be repeated O log δ −1 times to increase this probability
to 1 − δ via majority voting). This classifier has a value function Vπ (s0 ) ≥ 0.98.
Note that this proof trivially generalizes to episodes of interaction with length greater than 1, when
preserving the absence of temporal correlation in the states experienced by the agents. For episodes of
length H, the only change is that the value function of any policy, and hence the bounds we achieve,
H
get multiplied by a factor of 1−γ
1−γ for a discount factor γ < 1 and by a factor H for γ = 1.
Partially random: the Cliffwalk-DLP environment. One major criticism to the result of Lemma
4 is that it applies to a very degenerate, fully random RL environment. In the following, we show
that similar results can be obtained in environments based on the same classification problem, but
while imposing more temporal structure and less randomness (such constructions were introduced
in Ref. [50], but for the purpose of query separations between RL and QRL). For instance, one can
consider cliffwalk-type environments, inspired by the textbook “cliff walking” environment of Sutton
& Barto [44]. This class of environments differs from the previous SL-DLP environments in its state
and reward structure: in any episode of interaction, experienced states follow a fixed “path” structure
(that of the cliff) for correct actions, and a wrong action yields to immediate “death” (negative reward
and episode termination). We slightly modify this environment to a “slippery scenario” in which,
with a δ probability, any action may lead to a uniformly random position on the cliff. This additional
randomness allows us to prove the following separation:
Lemma 5 (Quantum advantage in Cliffwalk-DLP). There exists a uniform family of Cliffwalk-DLP
MDPs with arbitrary slipping probability δ ∈ [0.86, 1] and discount factor γ ∈ [0, 0.9], each derived
from a labeling function f ∗ of the DLP concept class C, for which classical hardness and quantum
learnability holds. More specifically, the performance of any classical learner is upper bounded by
Vrand (s0 ) + 0.1, while that of a class of quantum agents is lower bounded by Vopt (s0 ) − 0.1 with
probability above 2/3 (over the randomness of their interaction with the environment and noise in
their implementation). Since Vrand (s0 ) ≤ − 12 and Vopt = 0, we always have a classical-quantum
separation.
23
of this policy in any episode of length k + 1 with an initial state s0 ∈ {x0 , l}. Since the testing state
xk is the only state to be rewarded, we can already note that Vπ (x0 ) = π(f ∗ (xk )|T, xk ), that is, the
probability of the policy correctly labeling the testing state xk after having experienced the training
states T . Also, since s0 ∈ {x0 , l} and Vπ (l) = 0, we have Vπ (x0 ) ≥ Vπ (s0 ).
With this construction, we obtain the following result:
Lemma 6 (Quantum advantage in Deterministic-DLP). There exists a uniform family of
Deterministic-DLP POMDPs (exponentially many instances for a given concept fs of the DLP
classification problem) where:
1) (classical hardness) if there exists a classical learning agent which, when placed in a randomly
chosen instance of the environment, has value Vc (s0 ) ≥ 1/2 + 1/poly(n) (that is, 1/poly(n) better
than a random agent), with probability at least 0.845 over the choice of environment and the random-
ness of its learning algorithm, then there exists an efficient classical algorithm to solve DLP,
2) (quantum learnability) there exists a class of quantum agents that attains a value Vq (s0 ) = 1 (that
is, the optimal value) with probability at least 0.98 over the choice of environment and randomness
of the learning algorithm.
• SL-DLP and Deterministic-DLP are the two closest environments to the DLP classification task
of Liu et al. While the value function in SL-DLP is trivially equivalent to the accuracy of the
classification problem, we find the value function in Deterministic-DLP to be weaker than this
accuracy. Namely, a high accuracy trivially leads to a high value while a high (or non-trivial) value
does not necessarily lead to a high (or non-trivial) accuracy (in all these cases, the high probability
over the randomness of choosing the environments and of the learning algorithms is implied). This
explains why the classical hardness statement for Deterministic-DLP is weaker than in SL-DLP.
• In Cliffwalk-DLP, it is less straightforward to relate the testing accuracy of a policy to its per-
formance on the deterministic parts of the environment, which explains why we trivially upper
bound this performance by 0 on these parts. We believe however that these deterministic parts will
actually make the learning task much harder, since they strongly restrict the part of the state space
the agents can see. This claim is supported by our numerical experiments in Sec. 4.2. Also, since
we showed classical hardness for fully deterministic environments, it would be simple to construct
a variant of Cliffwalk-DLP where these deterministic parts would be provably hard as well.
H Proof of Lemma 5
Consider a slippery cliffwalk environment defined by a labeling function f ∗ in the concept class C of
Liu et al. This cliffwalk has p − 1 states ordered, w.l.o.g., in their natural order, and correct actions
(the ones that do not lead to immediate “death") f ∗ (i) for each state i ∈ Z∗p . For simplicity of our
proofs, we also consider circular boundary conditions (i.e, doing the correct action on the state p − 1
of the cliff leads to the state 1), random slipping at each interaction step to a uniformly sampled state
on the cliff with probability δ > 0, an initialization of each episode in a uniformly sampled state
i ∈ Z∗p , and a 0 (−1) reward for doing the correct (wrong) action in any given state.
The value function of any policy π which has probability π(i) (we abbreviate π(f ∗ (i)|i) to π(i)) of
doing the correct action in state i ∈ Z∗p is given by:
p−1
X
1
Vπ (i) = π(i)γ (1 − δ)Vπ (i + 1) + δ Vπ (j) − (1 − π(i)) (11)
p − 1 j=1
24
Since this environment only has negative rewards, we have that Vπ (i) ≤ 0 for any state i and policy
π, which allows us to write the following inequality:
p−1
X
1
Vπ (i) ≤ π(i)γ δ Vπ (j) − (1 − π(i))
p − 1 j=1
p−1
! p−1
1 X γδ X
= π(i) Vπ (j) + 1 − 1
p − 1 i=1 p − 1 j=1
We note that the first factor is exactly the accuracy of the policy π on the classification task of Liu et
al.:
p−1
1 X
Accπ (f ∗ ) = π(i).
p − 1 i=1
We hence have:
p−1 p−1
1 X 1 X
Vπ (i) ≤ Accπ (f ∗ ) γδ Vπ (j) + 1 − 1
p − 1 i=1 p − 1 j=1
Again, by noting in Eq. (11) that we have Vπ (i) ≤ 0 and π(i) ≤ 1 for any policy π and state i ∈ Z∗p ,
we have:
p−1
X
δ
Vπ (i) ≥ γ (1 − δ)Vπ (i + 1) + Vπ (j) − (1 − π(i))
p − 1 j=1
We use this inequality to bound the value function at the initial state s0 :
p−1
1 X
Vπ (s0 ) = Vπ (i)
p − 1 i=1
p−1
X p−1
X p−1
1 − δ δ 1 X
≥γ Vπ (i + 1) + Vπ (j) + π(i) − 1
p − 1 i=1 p − 1 j=1 p − 1 i=1
25
H.3 Bounds for classical hardness and quantum learnability
We use the bounds derived in the two previous sections to prove classical hardness and quantum
learnability of this task environment, as stated in Lemma 5.
For this, we start by noting the following expression for the value function of a random policy (one
that does random actions in all states):
p−1
X p−1
X
γ 1−δ δ 1
Vrand (s0 ) = Vrand (i + 1) + Vrand (j) −
2 p − 1 i=1 p − 1 j=1 2
γ 1 1
= Vrand (s0 ) − = −
2 2 2−γ
again due to the circular boundary conditions of the cliffwalk and the resulting absence of termination
conditions outside of “death".
As for the value function of the optimal policy, this is trivially Vopt = 0.
26
Given that g(0.51, 0.86, 0.9) ≈ 0.0995 < 0.1, we then get our desired result for δ0 = 0.86 and
γ1 = 0.9. Noting that Vπ (s0 ) − Vrand (γ) ≤ g(x, δ, γ) ≤ 0.1 from Eq. (12), we hence have classical
hardness ∀ (δ, γ) ∈ [δ0 , 1] × [0, γ1 ].
I Proof of Lemma 6
I.1 Proof of classical hardness
≥ (1 − δ)(1 − ε)
P
since Pr [fc (xk ) = f ∗ (xk )|T, xk ] ≥ 1 − ε for instances in Sδ and T,xk P (T, xk )Pr success|T, xk
≥ 1 − δ by definition.
In the following, we set 1 − ε = 0.999 and 1 − δ ≥ 0.845 (the reason for this becomes apparent
below), such that:
5 1
EP (T ) [Accfc (T )] ≥ 0.844155 > + (15)
6 96
Now, consider the following learning algorithm: given a training set T , construct a Deterministic-DLP
environment that uses this T and a randomly chosen xk , and define the classifier fc that boosts the
π(a|T, xk ) obtained by running our classical agent on this environment (as explained above). We
want to show that fc has accuracy Accfc (T ) ≥ 21 + poly(n)
1
with probability at least 2/3 over the
choice of T and the randomness of its construction, which is sufficient to solve DLP classically. For
that, we show a stronger statement. Call Tsucc the subset of all instances of training states T = {T }
27
1 1 2|T |
for which Accfc (T ) ≥ 2 + poly(n) . We prove by contradiction that |Tsucc | ≥ 3 :
2|T |
Assume |Tsucc | < 3 , then
X
EP (T ) [Accfc (T )] = P (T )Accfc (T )
T
1 X X
= Accfc (T ) + Accfc (T )
|T |
T ∈Tsucc T ∈T
/ succ
|Tsucc | |T | − |Tsucc | 1 1
< ×1+ +
|T | |T | 2 poly(n)
5 1
< + < 0.844155
6 3poly(n)
for large enough n, in contradiction with Eq. (15).
Hence, with probability at least 2/3 over the choice of training states and the randomness of the
learning algorithm, our constructed classifier has accuracy Accfc (T ) ≥ 12 + poly(n) 1
. By using
Theorem 8, Remark 1 of Liu et al., this is sufficient to construct an efficient classical algorithm that
solves DLP.
Using the learning algorithm of Liu et al., we can construct a quantum classifier that achieves accuracy
Accq (T ) ≥ 0.99 with probability at least 2/3 over the randomness of the learning algorithm and the
choice of training states T , of length |T |
= poly(n). Now define instead training states T of length
|T | = M poly(n), for M = O log δ 0−1 (hence |T | is still polynomial in n), and use each of the
M segments of T to train M independent quantum classifiers. Define fq as a classifier that labels xk
using a majority vote on the labels assigned by each of these classifiers. This constructed classifier
has accuracy Accfq (T ) ≥ 0.99 with now probability 1 − δ 0 over the choice of training states T and
the randomness of the learning algorithm.
We then note that, by calling “success" the event Accfq (T ) ≥ 0.99, we have:
X
P (T, xk )Pr Vq (x0 ) = 1|T, xk
T,xk
X X
≥ P (T ) P (xk )Pr success|T × Pr Vq (x0 ) = 1|T, xk , success
T xk
X X
= P (T )Pr success|T P (xk ) × Pr fq (xk ) = f ∗ (xk )|T, xk , success
T xk
X
= P (T )Pr success|T Accfq (T )
T
≥ (1 − δ 0 ) × 0.99
which means that our constructed agent achieves a value Vq (x0 ) = 1 (which also implies Vq (s0 ) = 1)
with probability at least (1 − δ 0 ) × 0.99 over the choice of environment and the randomness of the
learning algorithm. By setting (1 − δ 0 ) = 0.98/0.99, we get our statement.
To understand the distinction between the quantum learners of Liu et al. and the PQC policies we are
constructing here, we remind the reader of the two models for quantum SVMs defined in Ref. [7]:
28
the explicit and the implicit model. Both models share a feature-encoding unitary U (x) that encodes
data points x into feature state |φ(x)i = U (x) |0⊗n i.
In the implicit model, one first evaluates the kernel values
2
K(xi , xj ) = |hφ(xi )|φ(xj )i| (16)
for the feature states associated to every pair of data points {xi , xj } in the dataset, then uses the
resulting kernel matrix in a classical SVM algorithm.P This algorithm returns a hyperplane classifier
in feature space, defined by its normal vector hw| = i αi hφ(xi )| and bias b, such that the sign of
2
|hw|φ(x)i| + b gives the label of x.
In the explicit model, the classifier is instead obtained by training a parametrized |wθ i. Effectively,
this classifier is implemented by applying a variational unitary V (θ) on the feature states |φ(x)i and
2
measuring the resulting quantum states using a fixed observable, with expectation value |hwθ |φ(x)i| .
In the following sections, we describe how the implicit quantum SVMs of Liu et al. can be transformed
into explicit models while guaranteeing that they can still represent all possible optimal policies in
the DLP environments. And in Appendix K, we show that, even under similar noise considerations as
Liu et al., these optimal policies can also be found using poly(n) random data samples.
As we just described, our classifier belongs to a family of so-called explicit quantum SVMs. It
is hence described by a PQC with two parts: a feature-encoding unitary U (x), which creates
features |φ(x)i = U (x) |0⊗n i when applied to an all-0 state, followed by a variational circuit V (θ)
parametrized by a vector θ. The resulting quantum state is then used to measure the expectation value
hOix,θ of an observable O, to be defined. We rely on the same feature-encoding unitary U (x) as the
one used by Liu et al., i.e., the unitary that creates feature states of the form
k
2 −1
1 X
|φ(x)i = √ x · gi (17)
k
2 i=0
for k = n − t log(n), where t is a constant defined later, under noise considerations. This feature
0
state can be seen as the uniform superposition of the image (under exponentiation s0 7→ g s ) of an
k
interval of integers [logg (x), logg (x) + 2 − 1] in log-space. Note that U (x) can be implemented in
e 3 ) operations [14].
O(n
functions fs ∈ C to be learned (see Eq. (9)) is delimiting two equally-
By noting that every labeling
sized intervals of log Z∗p , we can restrict the decision boundaries to be learned by our classifier
to be half-space dividing hyperplanes in log-space. In feature space, this is equivalent to learning
separating hyperplanes that are normal to quantum states of the form:
(p−3)/2
1 X 0 E
|φs0 i = p g s +i . (18)
(p − 1)/2 i=0
Noticeably, for input points x such that logg (x) is away from some delimiting regions around s and
2 k+1
s + p−3 2
2 , we can notice that the inner product |hφ(x)|φs i| is either ∆ = p−1 or 0, whenever x is
labeled +1 or −1 by fs , respectively. This hence leads to a natural classifier to be built, assuming
2
overlaps of the form |hφ(x)|φs0 i| can be measured:
2
1, if |hφ(x)|φs0 i| /∆ ≥ 1/2,
hs0 (x) = (19)
−1, otherwise
which has an (ideal) accuracy 1 − ∆ whenever s0 = s.
To complete the construction of our PQC classifier, we should hence design the composition of its
variational part V (θ) and measurement O such that they result in expectation values of the form
2
hOix,θ = |hφ(x)|φs0 i| . To do this, we note that, for |φs0 i = V̂ (s0 ) |0i, the following equality holds:
2
2
|hφ(x)|φs0 i| = 0⊗n V̂ (s0 )† U (xi ) 0⊗n
= Tr 0⊗n 0⊗n ρ(x, s0 )
29
where ρ(x, s0 ) = |ψ(x, s0 )i hψ(x, s0 )| is the density matrix of the quantum state |ψ(x, s0 )i =
V̂ (s0 )† U (xi ) |0⊗n i. Hence, an obvious choice of variational circuit is V (θ) = V̂ (s0 ), combined with
a measurement operator O = |0⊗n i h0⊗n |. Due to the similar nature of |φ0s i and |φ(x)i, it is possible
0
to use an implementation for V̂ (s0 ) that is similar to that of U (xi ) (take xi = g s and k ≈ n/2).3 We
also note that, for points x such that logg (x) is (p − 1)∆/2 away from the boundary regions of hs0 ,
2
the non-zero inner products |hφ(x)|φs0 i| are equal to ∆ = O(n−t ). These can hence be estimated
efficiently to additive error, which allows to efficiently implement our classifier hs0 (Eq. (19)).
The noise has the effect that some points which would be correctly classified by the noiseless classifier
have now a non zero probability of being misclassified. To limit the overall decrease in classification
accuracy, we focus on limiting the probability of misclassifying points xi such that logg (xi ) is
(p − 1)∆/2 away from the boundary points s0 and s0 + p−3 2 of gs . We call Is the subset of Zp
0 0
∗
comprised of these points. For points in Is0 , the probability of misclassification is that of having
|eis0 | ≥ ∆/2. We can use Chebyshev’s inequality to bound this probability:
∆ 4
Pr |eis0 | ≥ ≤ 2 (20)
2 ∆ R
since E[eis0 ] = 0 and Var[eis0 ] ≤ 1/R.
30
The proof of this Theorem is detailed below.
Given a training set X ⊂ X polynomially large in n, i.e., |X| = nc , define the training loss:
1 X
L(s0 ) = |hs0 (x) − fs (x)|
2|X|
x∈X
and its noisy analog:
e 0) = 1 X e
L(s hs0 (x) − fs (x)
2|X|
x∈X
Our optimization
algorithm goes as follows: using the noisy classifier e
hs0 , evaluate the loss function
e s0
L logg (x) for each variational parameter g = x ∈ X, then set
0
e
g s = argminx∈X L(log g (x)).
This algorithm is efficient in the size of the training set, since it only requires |X| evaluations of e
2
hs0 .
To prove Theorem 3, we show first that we can enforce argminx∈X L(logg (x)) = e
argminx∈X L(logg (x)) with high probability (Lemma 7), and second, that this algorithm also leads
to s0 close to the optimal s in log-space with high probability (Lemma 8).
Lemma 7. For a training set of size nc such that c ≥ logn (8/δ), a t ≥ 3c in the definition of ∆, and
2(t+c)
a number of circuit evaluations per inner product R ≥ 4n δ , we have
e δ
Pr argmin L(log g (x)) = argmin L(log g (x)) ≥1−
x∈X x∈X 2
Proof. In order for the minima of the two losses to be obtained for the same x ∈ X, it is sufficient to
ensure that the classifiers hlogg (xi ) and e
hlogg (xi ) agree on all points xj , for all (xi , xj ) ∈ X. This
can be enforced by having:
\ \ ∆
xi ∈ Ilogg (xj ) ∩ |ei,s0 | ≤
i,j 0
2
i,s
i6=j
that is, having for all classifiers hlogg (xj ) that all points xi ∈ X, xi 6= xj , are away from its boundary
regions in log-space, and that the labels assigned to these points are all the same under noise.
We bound the probability of the negation of this event:
[ [ ∆ [ [ ∆
Pr xi ∈ / Ilogg (xj ) ∪ |ei,s0 | ≥ ≤ Pr xi ∈ / Ilogg (xj ) + Pr |ei,s0 | ≥
i,j 0
2 i,j 0
2
i,s i,s
i6=j i6=j
X∆ n2c ∆
= ≤
i,j
2 2
i6=j
By setting t ≥ 3c, we have ∆ ≤ 4n−t ≤ 4n−3c , which allows us to bound this first probability by
δ/4 when c ≥ logn (8/δ).
As for the second probability above, we have
[ ∆ X ∆
Pr |ei,s0 | ≥ ≤ Pr |ei,s0 | ≥
0
2 0
2
i,s i,s
4n2c
≤
∆2 R
31
2(t+c)
16n2c
using the union bound and Eq. (20). By setting R ≥ 4n δ ≥ ∆2 δ (since ∆ ≥ 2n−t ), we can
bound this second probability by δ/4 as well, which gives:
[ [
e ∆
Pr argmin L(log g (x)) = argmin L(logg (x)) ≥ 1 − Pr xi ∈
/ Ilogg (xj ) ∪ |ei,s0 | ≥
x∈X x∈X i,j 0
2
i,s
i6=j
≥ 1 − δ/2
log(δ/2)
Lemma 8. For a training set of size nc such that c ≥ logn log(1−2ε) , then s0 =
logg argminx∈X L(logg (x)) is within ε distance of the optimal s with probability:
0
|s − s| δ
Pr ≤ε ≥1−
p−1 2
From here, we notice that, when we apply Corollary 1 for ε0 ≤ ∆ 2 , our optimization algorithm returns
an s0 such that, with probability 1 − δ, the set Is0 is equal to Is and is of size (p − 1)(1 − 2∆). In the
event where |s0 − s|/(p − 1) ≤ ε0 ≤ ∆ 2 , we can hence bound the accuracy of the noisy classifier:
1 X e
Acceh 0 (fs ) = Pr hs0 (x) = fs (x)
s p−1
x∈X
1 X e
≥ Pr hs0 (x) = fs (x)
p−1
x∈Is
∆
≥ (1 − 2∆) min Pr |ei,s0 | ≤
xi ∈Is 2
4
≥ (1 − 2∆) 1 − 2
∆ R
4 4
= 1 − 2∆ 1 − 2 + 2
∆ R ∆ R
32
with probability 1 − δ. n 2(t+c) o
We now set t ≥ max {3 logn (8/δ), logn (16/ε)}, ε0 = n−t and R ≥ max 4n δ , 128
ε 3 , such that
0 −t −t ε 4
4 ε
2ε = 2n ≤ ∆ ≤ 4n ≤ 4 , 1 − ∆2 R ≤ 1 and ∆2 R ≤ 2 .
Using these inequalities, we get
Acceh 0 (fs ) ≥ 1 − ε
s
33