Sparks of Quantum Advantage and Rapid Retraining in Machine Learning
Sparks of Quantum Advantage and Rapid Retraining in Machine Learning
Sparks of Quantum Advantage and Rapid Retraining in Machine Learning
Learning
William Troy1*
Abstract
he advent of quantum computing holds the potential to revolutionize various fields by
T
solving complex problems more efficiently than classical computers. Despite this promise,
practical quantum advantage is hindered by current hardware limitations, notably the small
number of qubits and high noise levels. In this study, we leverage adiabatic quantum computers
to optimize Kolmogorov-Arnold Networks, a powerful neural network architecture for
representing complex functions with minimal parameters. By modifying the network to use
Bézier curves as the basis functions and formulating the optimization problem into a Quadratic
Unconstrained Binary Optimization problem, we create a fixed-sized solution space, independent
of the number of training samples. Our approach demonstrates sparks of quantum advantage
through faster training times compared to classical optimizers such as the Adam, Stochastic
Gradient Descent, Adaptive Gradient, and simulated annealing. Additionally, we introduce a
novel rapid retraining capability, enabling the network to be retrained with new data without
reprocessing old samples, thus enhancing learning efficiency in dynamic environments.
Experimental results on initial training of classification and regression tasks validate the efficacy
of our approach, showcasing significant speedups and comparable performance to classical
methods. While experiments on retraining demonstrate a sixty times speed up using adiabatic
quantum computing based optimization compared to that of the gradient descent based
optimizers, with theoretical models allowing this speed up to be even larger! Our findings
suggest that with further advancements in quantum hardware and algorithm optimization,
quantum-optimized machine learning models could have broad applications across various
domains, with initial focus on rapid retraining.
1. Introduction
he advent of quantum computing (QC) promises to revolutionize various fields by
T
solving complex problems more efficiently than classical computers. This is possible as quantum
computers leverage the principles of superposition and entanglement to perform computations
that would be infeasible for classical computers, potentially offering exponential speedups for
certain types of problems1–3. For instance, Shor'salgorithm for factoring large numbers can
theoretically break widely used cryptographic systems much faster than the best classical
algorithms4. Despite this promise, the practical realizationof quantum advantage is hindered by
current limitations in quantum hardware, notably the relatively small number of qubits and high
noise levels on modern quantum processors. These limitations make it challenging to solve
large-scale problems and require highly optimized algorithms to make the most of the existing
1
Independent Researcher
*
orresponding Author: [email protected]
C
q uantum resources5. To date, quantum advantage has only been demonstrated a couple of times,
notably by Google’s Sycamore processor and in Gaussian boson sampling experiments6,7.
While these achievements mark significant milestones, quantum advantage in other
domains remains difficult to achieve. A particularly promising area of application for QC is
quantum machine learning (QML), which integrates quantum computing with machine learning
techniques to leverage quantum speedups for training and inference tasks8–10. Platforms like
TensorFlow and Qiskit have already started integrating quantum models, providing tools for
researchers to explore QML applications11,12. Despitethese advancements, Google recently
highlighted a significant gap in the field at Google I/O 2024, stating that no one has yet
demonstrated a clear quantum advantage for machine learning on classical data11 This gap
underscores the need for innovative approaches that can bridge this divide and showcase the
practical benefits of quantum computing in machine learning.
In this study, we address this challenge by leveraging adiabatic quantum computing to
optimize Bézier Kolmogorov-Arnold Networks (KANs). KANs represent a powerful neural
network architecture known for their unique ability to represent complex functions with a
relatively small number of parameters14,15. However,due to the complexities of the model the
optimization of KANs is currently quite slow compared to multi-layered perceptions. By
reformulating the optimization problem from that of a classical backpropagation algorithm to a
Quadratic Unconstrained Binary Optimization (QUBO) problem we are able to create a fixed
sized solution space that is independent of the number of training samples. Using D-Wave
quantum annealers as our adiabatic quantum computers we are then able to find the optimal
weights for this network where the number of qubits and run time on the quantum computer is
independent of the number of training samples.
Our approach demonstrates sparks of quantum advantage when testing on small models
which are able to fit within the currently limited number of qubits available. This demonstration
is done by achieving faster training times compared to classical optimizers. These classical
optimizers include one of the most popular optimization strategies in the Adam Optimizer16–20 as
well as Stochastic Gradient Descent21 (SGD), AdaptiveGradient22 (AdaGrad), and the classical
parallel to quantum annealing in optimization by simulated annealing23,24. Moreover, we
introduce a novel rapid retraining capability, enabled by the ability to retrain with new data
without the need to reprocess old samples. This feature is particularly advantageous in dynamic
environments where data is continuously evolving, such as real-time analytics, due to the fact
that a quantum annealer can perform an entire optimization in tens of microseconds25.
2. Results
e conducted a series of experiments to evaluate the performance of our Bézier-KANs
W
optimized using different methods: backpropagation with the continuous variable gradient
descent based optimizers via pytorch26, QUBO withsimulated annealing, and QUBO with
quantum annealing. Our results cover five separate tasks which include two classification tasks
with 1-layer KANs and three regression tasks using multi-layer KANs. Additionally, we explore
the rapid retraining capability of our KAN formulation and its impact on training speed. Tasks
chosen remain relatively simplistic due to the extremely limited number of quality qubits
available in current hardware.
Figure 1.The training and testing metrics of Bézier-KANsoptimized using classical optimizers and quantum
annealing on (a) the circle dataset and (b) the moon dataset. (c) The respective training times for these different
optimization scenarios.
igure 2.(a) MSE and (b) R-Squared values of Bézier-KANstrained on regression dataset 1 using classical
F
optimizers and hybrid quantum annealing. (c) MSE and (d) R-Squared values of Bézier-KANs trained on regression
dataset 2 using an Adam optimizer, simulated annealing, and pure quantum annealing. (e) Time taken for initial
training and retraining using our different optimizers on regression dataset 1 and (f) on regression dataset 2. The
initial training had 1,000,000 data points while each retraining added 100,000 new data points.
I t should be mentioned that the hybrid quantum annealer used here finds solutions in
around three seconds, as opposed to the tens of microseconds of a pure quantum annealer. It was
necessary to use a hybrid quantum annealer for some of our regression tasks as they use a greater
number of variables and hybrid quantum annealers can more consistently solve for a larger
n umber of variables than current pure quantum annealers cand. Despite this large time increase
introduced to the quantum solutions by using a hybrid annealer, our hybrid method is able to
perform all four rounds of iterative retraining in an eighth of the time taken to perform a single
round of training by the gradient descent based optimizers for regression dataset 1! On top of this
even the simulated quantum annealer dramatically out performs the gradient descent based
optimizers on these tasks, due to not having to reprocess already seen samples. This is even taken
a step further when looking at dataset 2 as the pure quantum annealer is able to perform all four
rounds of retraining in less than 2% of the time taken by the gradient descent based optimizers to
perform the same task!
Regression dataset 3, given by Eq. 12, involved a more complex KAN with a larger
number of binary variables, over 200. Due to this large number of binary variables current
quantum annealers struggled to encode the task well and converge, while the simulated annealer
managed to find a solution if allowed to run enough times. With this we varied the degree of one
of the Bézier curves on the bottom layer from one to two, leaving the rest of the network
untouched with the other bottom Bézier curve and top Bézier curve having degrees of two and
one, respectively, and performed 50 hybrid optimizations for both of our modified networks,
Figure 3.
Figure 3.Distribution of (a) MSE and (b) R-Squaredvalues for 100 hybrid optimizations of a Bézier-KAN, where
the degree of a Bézier curve in the first layer was changed from 1 to 2.
rom this it can be seen that against what one would expect if we increase the flexibility
F
of the model to learn by increasing the degrees it has access to it actually can negatively affect
the model by increasing the chance of finding a less optimal solution. Although this is
unexpected it is attributed to the current state of quantum annealers in which the number of
variables is limited while both noise and chain breaks are very prevalent. For comparison,
simulated annealing was able to find a solution with a MSE value of 0.0041 and R-Squared of
0.963 for the Bézier curve being modified having a degree of one.
3. Discussion
espite the limitations in the number and quality of current qubits, our work
D
demonstrates that it is already possible to create machine learning frameworks with training
speeds that outperform current classical methods in specific scenarios while maintaining
c omparable performance. This advantage is particularly evident when retraining networks with
additional data. With pure quantum annealing, hybrid quantum annealing, and simulated
annealing all performing retraining faster than the gradient descent based optimizers, with pure
quantum annealing having the fastest retraining times. This success is achieved by creating
fixed-sized solution spaces with states that can be saved and rapidly solved using an annealer.
We anticipate further improvements in rapid retraining speed as advancements in quantum
hardware allow pure quantum annealers to accurately solve more complex problems.
In theory we should be seeing even larger speedups. This is because all that is needed for
our quantum optimization is a forward pass of the network before quantum annealing. Given that
theoretical annealing runs in a fraction of a second regardless of the problem size, the main time
component of the quantum optimization framework is the singular forward pass. Comparing this
to classical back propagation algorithms that require both a forward and backward pass for each
training epoch, it is easy to see that we should be seeing much greater experimental speedups. As
our experiments require 50 to 150 epochs of classical training, even without including the time
complexity of gradient calculations in backwards passes, one would expect to see speedups on
the scale of the number of training epochs.
We attribute this discrepancy between theoretical speedups and what we see
experimentally to code optimization. Given that data preprocessing constitutes a vast majority of
the runtime in most of our experiments, supplementary material S1, we anticipate that further
optimization of our code will lead to experimental speedups closer to that of theoretical.
There is also significant room for improvement of this algorithm with additional research
focused on optimizing the quantum annealing process, reducing the number of auxiliary
variables, problem decomposition, implementing similar approaches on gate-based quantum
computers29, and exploring the continuous optimizationspace, which is known to be simpler than
that of the discrete space30. Perhaps the most promisingarea of all for advancing this
methodology will be all-to-all connected quantum annealers31,32.All-to-all connectivity in
quantum annealers promises to drastically reduce, if not eliminate, the need for auxiliary
variables when solving HUBO problems33. This wouldtheoretically eliminate the scalability
issue of our algorithm caused by the increasing need for auxiliary variables as the number of
layers in a network increases, allowing for a one-to-one qubit to variable representation of the
entire system.
With these things in mind, we foresee this as just the first of a new class of
quantum-optimizable machine learning models that create fixed-sized solution spaces. Exploring
alternative KAN formulations using other spline functions, Taylor series expansions of spline
basis functions, other objective functions, or entirely different machine learning models could
lead to the development of more models that fit within this class.
As potential applications of this model type span various fields, particularly where rapid
retraining can have a large effect, including finance, healthcare, environmental monitoring, and
cybersecurity there is a large incentive to expand on and optimize our proposed model type.
Along with this, although we already demonstrate experimental speedups in the training time of
o ur model compared to classical optimization, continued advancements in quantum computing
will naturally enhance the performance of our model. Allowing for a potential future in which
complex machine learning networks could theoretically be trained and retrained on quantum
computers at speeds dramatically faster than their classical counterparts are capable of.
4. Methods
ANs are a neural network architecture designed to represent complex functions with a
K
minimal number of parameters. In order to achieve this representation, KANs build networks
that are composed of basis functions. The original KAN representation used B-spline basis
functions summed with weighted activation functions to manipulate edges of a network.
However, it has also been shown that many other types of basis functions including
Fourier-splines34 and wavelets15 can also work tocreate KANs. While we are able to use our end
formulation to optimize one layer KANs of many different basis functions including B-splines,
fourier-splines, and gaussian radial basis function splines, supplementary material S2, these
spline types are not able to fit into the QUBO framework when stacked. This is because they are
composed of either non-differentiable base functions when fully expanded or functions whose
variables are not able to be expressed in linear and quadratic terms alone. While this is not a
problem for traditional optimization methods, as we will see, it becomes a significant issue when
optimizing a multi-layered KAN using QUBO methods.
→ 𝑁 𝑁
𝑓( 𝑞) = ∑ 𝑄𝑖,𝑖𝑞𝑖 + ∑ 𝑄𝑖,𝑗𝑞𝑖𝑞𝑗 (2)
𝑖=1 𝑖<𝑗
where𝑄𝑖,𝑖 and𝑄𝑖,𝑗 denote the diagonaland off-diagonal terms, respectively, of a QUBO matrix35.
hese binary variables in our case represent the control points in our basis functions that
T
make up the KANs. These continuous control points can be made into binary variables using
multiple forms of binary discretization, commonly in the form radix 2:
𝑚
𝑙
𝑥𝑖 ≈ ∑ 2𝑞𝑖,𝑙 (3)
𝑙=
−𝑚
hile QUBO problems are ideal for quantum annealing, the formatting of a KAN
W
optimization problem into one presents several unique challenges. The first difficulty is that
variables of interest are required to be able to be expressed in only linear or quadratic terms. The
second difficulty lies in the need to fully expand the basis functions of a KAN into their
component terms. This expansion is crucial as the entire KAN can be thought of as one large
equation with each layer feeding into the next. This leads to a problem when functions used to
make a layer are discontinuous and reliant on one knowing the output of the previous layer.
Unlike traditional neural network optimization, where you know the current output of a layer, as
weights are iteratively adjusted to minimize the loss function via gradient descent, QUBO does
not initialize the weights before solving. As QUBO solves for all weights at the same time, on
what can be thought of as a black box, this approach eliminates the iterative modification of
variables which is so time consuming in backpropagation. Although, this QUBO formulation
brings in the extra requirements of variables being expressed in linear or quadratic terms and of
continuous differentiability across all functions involved.
where𝑃𝑖 are the control points and𝑡 ∈ [0, 1] . To formulate this into a QUBO format the
c ontinuous𝑃variables are then discretized usingvariations of radix 2 encoding. This makes the
theoretical number of logical qubits needed to solve the problem directly related to the number of
control points in a given problem and the number of binary variables used to represent each
control point.
4.3. Auxiliary Variables and HUBO Formulation
rom this, one can see how this may lead to variable interactions beyond the quadratic
F
order, the maximum order of which a QUBO formulation can solve for, creating a Higher Order
Unconstrained Binary Optimization (HUBO) problem, supplementary material S3. While this
may seem like a major problem, one can easily get around it by introducing auxiliary variables
with a singular variable being used to represent 2 or more others. An auxiliary variable that
represents 2 other variables can be given by:
𝑉𝐴𝑈𝑋 = 𝑃1 𝑃
2 (6)
With this relation being enforced by adding penalty terms the the QUBO problem in the form of:
here𝑤is the penalty coefficient37, 15-25 timesthe max coefficient value in our QUBO matrix
w
in this study.
This makes the total number of logical qubits needed to optimize a network equal to the
number of control points multiplied by the number of binary variables used to represent each
control point plus the number of auxiliary variables:
𝑛𝑢𝑚𝑐𝑜𝑛𝑡𝑟𝑜𝑙_𝑣 𝑎𝑟𝑠
where𝑛in the number of training samples,𝑦𝑖 is the desired output at sample𝑖, and𝑦𝑖 is the
n etwork output. This objective function is a perfectly valid one and is commonly used in
machine learning, however, if optimized to its fullest extent it can easily lead to overfitting of the
training data. Since we cannot check for overfitting between training iterations, as we can
classically, we re-formulate the objective function so that it pays attention to a validation dataset
on its own:
𝑛𝑡 𝑛𝑣
2 2
𝑂𝐵𝐽 =
1
𝑛𝑡 (
∑ 𝑦𝑡 − 𝑦𝑡 +
𝑡=1
) λ
𝑛𝑣 (
∑ 𝑦𝑣 − 𝑦𝑣
𝑣=1
) (10)
with𝑛𝑡 and𝑛𝑣 denoting the number of training and validation samples, respectively, andλ
representing the validation dataset weighting.
2 2
𝑧 = 2 1 + 𝑥 + 𝑦 (12)
he dataset inputs were normalized from 0 to 1 so that inputs are within the preferable
T
range of a Bézier curve. Gradient descent based optimization was done using pytorch with a
modified version of the original KAN library28. Forall test cases involving the gradient descent
based optimizers, only results with the optimal learning rate and number of steps were reported
in the main text. The optimal gradient descent based learning rates were found by iteratively
testing different learning rates for each problem from 0.001 to 1.5, with the ‘best’ denoting the
learning rate that converged on the most optimal solution first. From this, the optimal number of
steps was found by stopping learning once the loss function of the best learning rate converged
for multiple training iterations.
For annealing, the objective problem formulation was done in c++ called from python
which then used the pyqubo library38,39 to convertthe𝑂𝐵𝐽to a QUBO matrix. Simulated
annealing was done using the D-Wave simulated annealing library while quantum annealing used
the D-Wave Advantage2 prototype 2.3 as a pure quantum annealer and hybrid binary quadratic
odel version 2 as a hybrid quantum annealer. Although quantum annealing can be run in tens of
m
microseconds, the large amount of quantum noise currently present in quantum systems requires
hundreds to thousands of quantum annealing runs to be done per problem. This is reflected in our
experimental timings and makes the time taken by quantum annealing go from tens of
microseconds to around a tenth of a microsecond for our test cases. Finally, since all
optimization methods tested are highly amenable to parallel computing, this factor was
eliminated and all classical computations were done on a single thread of an Intel Xeon Scalable
Platinum 8173M Processor processor running at a 2 GHz clock speed.
onflict of interest:
C
The authors declare that they have no conflict of interest.
References:
1. Georgopoulos, K. Quantum Computing:Understanding its Potential Applications.PECB Insights
https://insights.pecb.com/quantum-computing-understanding-potential-applications/ (2023).
2. Grover, L. K. & Sengupta, A. M. Classical analog of quantum search.Phys. Rev. A65, 032319 (2002).
3. A new quantum algorithm for classical mechanics with an exponential speedup.
http://research.google/blog/a-new-quantum-algorithm-for-classical-mechanics-with-an-exponential-spe
edup/.
4. Shor, P. W. Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a
doi:10.1038/d41586-023-04007-0.
505–510 (2019).
8. Date, P., Arthur, D. & Pusey-Nazzaro, L. QUBO formulations for training machine learning models.
209–212 (2019).
https://doi.org/10.48550/arXiv.2405.08810 (2024).
13. Chou, C. & Lucero, E. Quantum computing: Facts, fiction and the future. (2024).
https://doi.org/10.48550/arXiv.2404.19756 (2024).
https://arxiv.org/abs/2405.12832v2 (2024).
16. Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization. Preprint at
https://doi.org/10.48550/arXiv.1412.6980 (2017).
17. Yuan, W. & Gao, K.-X. EAdam Optimizer: How $\epsilon$ Impact Adam. Preprint at
https://doi.org/10.48550/arXiv.2011.02150 (2020).
18. Brantner, B. Generalizing Adam to Manifolds for Efficiently Training Transformers. Preprint at
https://doi.org/10.48550/arXiv.2305.16901 (2023).
19. Ahn, K., Zhang, Z., Kook, Y. & Dai, Y. Understanding Adam Optimizer via Online Learning of
20. Gudla, S. P. K. & Bhoi, S. K. A Study on Effect of Learning Rates Using Adam Optimizer in
LSTM Deep Intelligent Model for Detection of DDoS Attack to Support Fog Based IoT Systems. in
21. Bottou, L. Large-Scale Machine Learning with Stochastic Gradient Descent. inProceedings of
22. Duchi, J., Hazan, E. & Singer, Y. Adaptive Subgradient Methods for Online Learning and
671–680 (1983).
25. Pelofske, E. Comparing Three Generations of D-Wave Quantum Annealers for Minor Embedded
26. Paszke, A.et al.PyTorch: An Imperative Style,High-Performance Deep Learning Library. in
27. Virtanen, P.et al.SciPy 1.0: fundamental algorithmsfor scientific computing in Python.Nat.
29. Glos, A., Krawiec, A. & Zimborás, Z. Space-efficient binary optimization for variational quantum
30. Liu, Z.et al.How Do Adam and Training StrategiesHelp BNNs Optimization. inProceedings of
31. Lechner, W., Hauke, P. & Zoller, P. A quantum annealing architecture with all-to-all connectivity
32. Pita-Vidal, M., Wesdorp, J. J. & Andersen, C. K. Blueprint for all-to-all connected
33. Yarkoni, S., Raponi, E., Bäck, T. & Schmitt, S. Quantum annealing for industry applications:
35. Jun, K. QUBO formulations for a system of linear equations.Results Control Optim.14, 100380
(2024).
36. O’Malley, D. & Vesselinov, V. V. ToQ.jl: A high-level programming language for D-Wave
machines based on Julia. in2016 IEEE High PerformanceExtreme Computing Conference (HPEC)
37. Jun, K. & Lee, H. HUBO formulations for solving the eigenvalue problem.Results Control
38. Zaman, M., Tanahashi, K. & Tanaka, S. PyQUBO: Python Library for Mapping Combinatorial
39. Tanahashi, K., Takayanagi, S., Motohashi, T. & Tanaka, S. Application of Ising Machines and a
S1:
igure S1.The training times of the quantum algorithmsplit up by step within the algorithm. These steps being
F
processing, quantum annealing, and post processing. Preprocessing includes formulation of the objective function,
the forward pass, and variable separation. The datasets depicted include (a) the timing of algorithm steps in the
circle and moon datasets, used in Figure 1, and (b) regression dataset 1 from Figure 2 of the main text. As can be
seen as the problem gets more intricate the preprocessing time becomes a majority of the quantum algorithm. We
attribute this to our implementation of the code for formulating the objective function not being fully optimized.
S2:
any types of functions work for a single-layer KAN network that is to be put in QUBO
M
format that do not work for a multi-layered KAN in QUBO format. This is due to the fact that for
a one layer network at the time of doing the forward pass we know the value of all inputs to each
basis function. Implementations of the B-spline, Fourier-spline, and Gaussian-Spline that are
quantum optimizable on a simulated annealer are shown in Figure S2.
igure S2.Different spline functions trained to fitto𝑦 = 𝑠𝑖𝑛(𝑥) by creating an MSE objective functionconverted
F
to QUBO format and solved using simulated annealing.
his becomes a problem when we start to stack layers as the output of one layer becomes
T
the input to another. When this happens points we are solving for are either expressed as
conditionals or in terms of functions which do not allow them to be expressed as either linear or
quadratic, both of which are unacceptable for QUBO.
S3:
iven that we are optimizing a 2-layer Bézier-KAN of shape [1,1,1] and we are using the
G
syntax𝐵𝑖,ℓ for our Bézier curves where𝑖isthe index in the layer andℓis layer; we can define
the entirety of the KAN as:
( )
𝑧 = 𝐵1,2 𝐵1 ,1 (S1)
I f we then say all of the Bézier curves in this KAN are of order 1 we can define each Bézier
curve as:
𝐵𝑖,ℓ(𝑡) = (1 − 𝑡)
𝑃
0 ,𝑖,ℓ + 𝑡𝑃1 ,𝑖,ℓ, 0 ≤ 𝑡 ≤ 1 (S2)
I f we then discretize the continuous control points using only the integer representation of radix
2 with two qubits,𝑞, per control point we thenend up with Eq. S3 for a given Bézier curve:
𝐵𝑖,ℓ(𝑡) = (1 − 𝑡) ( ) ( )
𝑞0,0,𝑖,ℓ + 2𝑞1 ,0,𝑖,ℓ + 𝑡 𝑞0,1,𝑖,ℓ + 2𝑞1 ,1,𝑖,ℓ , 0 ≤ 𝑡 ≤ 1 (S3)
ow considering our objective function is MSE as defined in Eq. 8 we see that we must
N
square Eq. S5. As this will quickly blow up into a much larger equation we will simplify it for
this example and just take the last two terms in Eq. S5 and square them to see that the problem
quickly becomes a higher order than quadratic with more than two𝑞variables being multiplied
together. Keeping in mind a binary variable squared is equal to the original binary variable:
ven in this simplistic example where we only have three unique binary variables in Eq.
E
S6, since everything is squared by the MSE objective function, variables become intertwined. In
which auxiliary variables become necessary to bring the problem back to quadratic. This
problem only becomes more pronounced when using higher-order Bézier curves within the
KAN. As everything being input into each Bézier curve is raised to the degree of that curve,
providing more opportunities to create a higher-order problem.