Duman Keles23a

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Proceedings of Machine Learning Research vol 201:1–23, 2023 34th International Conference on Algorithmic Learning Theory

On The Computational Complexity of Self-Attention

Feyza Duman Keles FD 2135@ NYU . EDU


Brooklyn, NY 10012
Pruthuvi Mahesakya Wijewardena PWIJEWARDENA @ MICROSOFT. COM
Redmond, WA 98052
Chinmay Hegde CHINMAY. H @ NYU . EDU
Brooklyn, NY 10012

Editors: Shipra Agrawal and Francesco Orabona

Abstract
Transformer architectures have led to remarkable progress in many state-of-art applications. However,
despite their successes, modern transformers rely on the self-attention mechanism, whose time- and
space-complexity is quadratic in the length of the input. Several approaches have been proposed
to speed up self-attention mechanisms to achieve sub-quadratic running time; however, the large
majority of these works are not accompanied by rigorous error guarantees. In this work, we establish
lower bounds on the computational complexity of self-attention in a number of scenarios. We prove
that the time complexity of self-attention is necessarily quadratic in the input length, unless the
Strong Exponential Time Hypothesis (SETH) is false. This argument holds even if the attention
computation is performed only approximately, and for a variety of attention mechanisms. As a
complement to our lower bounds, we show that it is indeed possible to approximate dot-product
self-attention using finite Taylor series in linear-time, at the cost of having an exponential dependence
on the polynomial order.

1. Introduction
Motivation. Building upon early successes in natural language processing (Vaswani et al., 2017;
Kenton and Toutanova, 2019), transformer models now form the core of virtually every state-of-
the-art approach in numerous applications: computer vision and image understanding (Dosovitskiy
et al., 2021; Khan et al., 2021), proteomics (Jumper et al., 2021), code synthesis (Chen et al., 2021a),
and vision-language models (Radford et al., 2021; Alayrac et al., 2022). At the heart of transformer
architectures is the self-attention mechanism, which can be viewed as a trainable “layer” that takes in
as input a set of tokens (vectors), computes pairwise dot-products between (projected forms of) the
tokens, performs a softmax operation to obtain non-negative weights, and produces output tokens
using these weights.
Unfortunately, by virtue of its very definition, the standard form of self-attention requires pairwise
token operations, and therefore incurs quadratic running time in terms of the number of input tokens.
This poses serious computational challenges not just in terms of the training cost of such models,
but even just for inference (forward passes) through transformer models. The community has long
acknowledged this fact, and numerous approximate forms of self-attention that break this quadratic
bottleneck have been proposed. Some approaches advocate reducing complexity through windowing,
striding, or some other sparsification of the attention scores (Kitaev et al., 2020; Zaheer et al., 2020).
Others leverage ideas from hashing (Kitaev et al., 2020; Choromanski et al., 2021). Yet others

© 2023 F.D. Keles, P.M. Wijewardena & C. Hegde.


O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

propose kernelizing the attention operation (Katharopoulos et al., 2020; Lu et al., 2021; Chen et al.,
2021b).
Empirically, all these methods certainly seem to lead to reduced running times: in some cases,
they reduce the costs from quadratic to linear. However, all of these methods incur some form of
error in the computation (compared to vanilla attention). These errors may have the undesirable (but
benign) effect of drop in accuracy, or may have more dramatic effects if fed with an adversarial input.
In any case, it would be useful to clearly establish rigorous guarantees on time/accuracy tradeoffs for
the above methods, but these are rare. The primary theoretical question that we ask is as follows:

What are the fundamental computational tradeoffs involved in self-attention?

Our contributions. In this paper, we pursue a complementary path from most previously published
work in this area. Somewhat surprisingly, we are able to establish (conditional) quadratic lower
bounds on the running time of self-attention in a large variety of settings. This quadratic barrier holds
even if we relax the self-attention operation and allow for additive, or multiplicative, errors in the
computation. We also prove quadratic — specifically, rectangular — lower bounds even if we allow
for windowing or striding. Finally, this holds even if we use kernelization (with radial basis function
kernels), which to our current knowledge, achieves the Pareto time/accuracy frontier in terms of
empirical performance among all fast algorithms for self-attention computation (Lu et al., 2021). In
Table 1, we summarize hardness results where checkmarks indicate the proven complexities in this
paper for different types of self-attention and calculation types.
Our results demonstrate that there may be a fundamental “no free lunch” phenomenon sitting
here: it seems unlikely that we can get (provably) sub-quadratic algorithms for self-attention that are
also (provably) near-accurate for all inputs.
Finally, while our primary contributions in this paper are mainly from the perspective of lower
bounds, we also provide some upper bounds. Specifically, we show that a finite Taylor series
approximation of the softmax function lends itself to an approximate form of self-attention that can
be computed in linear time. However, a caveat of this result is that the running time now scales
exponentially in the order of the Taylor polynomial.
Techniques. Our proofs are rather intuitive and are based on careful reductions from the Strong
Exponential Time Hypothesis (SETH). SETH-based lower bounds have attracted recent (but growing)
attention from the complexity theory community, and has been used to prove hardness results for
edit distance (Backurs and Indyk, 2015), Frechet distance (Bringmann, 2014), dynamic program-
ming (Bringmann and Künnemann, 2015), among many others (Bringmann, 2021). For machine
learning problems, such lower bounds are less common; still, quadratic-time barriers based on SETH
have been proved for kernel PCA and backpropagation through dense networks (Backurs et al., 2017),
as well as nearest neighbors (Rubinstein, 2018). Our results can be viewed as an addition to this
body of work.
A direct reduction from SETH is cumbersome. So instead, mirroring (Backurs and Indyk, 2015),
we derive reductions from the Orthogonal Vectors Problem (OVP), which is known to require almost-
quadratic time assuming SETH. As intermediate waypoints we visit two adaptations of OVP: the
Thresholded Vectors Product Problem (TVPP), and the Bichromatic Hamming Close Pair (BHCP)
problem. Reductions between these problems require construction of several “vector gadgets", and
form the bulk of the technical difficulty of our proofs. Another subtlety lies in identifying the correct
temperature scaling in the softmax in order to achieve the reductions. See Section 4 for details.

2
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

Table 1: Summary of our hardness results on self-attention


Calculation Type
Element-wise Element-wise
Self-Attention Exact
Multiplicative Approx. Additive Approx.
Exponential Dot-Product ✓ ✓ ✓
Softmax Dot-Product ✓ ✓ µ = O(n−d )
Window Sliding ✓ ✓ ✓
Exponential L2-Norm ✓ ✓ µ = O(n−d )

2. Related work
Attention mechanisms and transformers. Ever since the seminal work of (Vaswani et al., 2017),
transformer architectures (with self-attention layers as their primary building blocks) have become
the cornerstone of state-of-the-art machine learning models. Therefore, a firm theoretical under-
standing about the statistical and computational tradeoffs involved in transformer-based models is of
considerable interest. Transformers have already been shown to exhibit the universal approximation
property (Yun et al., 2019), but lose expressivity unless the self-attention mechanism is accompa-
nied with skip connections (Dong et al., 2021). Self-attention also exhibits undesirable Lipschitz
continuity properties, but this can be fixed by pursuing kernel-like alternatives (Kim et al., 2021).
Speeding up self-attention. Our focus in this paper is on the running time of self-attention
computation. It has been well-established that the (standard) definition of self-attention takes in as
input a length-n sequence of tokens of size d, and requires O(dn2 ) time to compute the output. The
quadratic dependence on n poses a challenge for very long input sequences, both from the training and
testing perspectives. Therefore, several sub-quadratic methods for evaluating attention layers have
been proposed. Approaches (such as the Reformer (Kitaev et al., 2020), Big Bird (Zaheer et al., 2020),
Linformer (Wang et al., 2020), Longformer (Beltagy et al., 2020), or routing transformers (Roy et al.,
2021)) use some combination of hashing, sparsification, or low-rank approximation to speed up the
computation of the attention scores. Other approaches involve replacing the softmax-based attention
with kernel approximations; cf. the work of (Katharopoulos et al., 2020), or the Nyströmformer
(Xiong et al., 2021). More recent works such as the Performer (Choromanski et al., 2021), Slim
(Likhosherstov et al., 2021), or RFA (Peng et al., 2021) approximate the attention computation using
random projections. Methods such as SOFT (Lu et al., 2021) or the Skyformer (Chen et al., 2021b)
propose to replace softmax operations with Gaussian kernels that can then be quickly evaluated.
Despite the large number (and diversity) of interesting algorithmic ideas involved in the above
efforts, the vast majority of these works only focus on improvement in running time; but few (if any)
theoretically characterize the error incurred by their proposed methods. Can there ever be a method
that is both fast (i.e., provably with sub-quadratic running time) as well as near-accurate (i.e., with
provably small additive or multiplicative error)? Our results show that this is unlikely to be the case.
Fine-grained complexity. Classical complexity theory has primarily focused on distinguishing
between problems with efficient (polynomial-time) solutions versus those who don’t. However,
a different (and finer) picture has begun to emerge over the last decade. In particular, the focus
has shifted towards precisely pinning down the exponent, c, of a problem that can be solved in

3
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

polynomial time Õ(nc ). Many of these results are conditional, and rely on reductions from popular
(but plausible) conjectures such as the Strong Exponential Time Hypothesis (SETH) (Impagliazzo
and Paturi, 2001), (Impagliazzo et al., 2001). See the relevant surveys (Indyk, 2017), (Rubinstein
and Williams, 2019), and (Bringmann, 2021) for a more concrete overview of the field. In particular,
this approach has been shown to provide conditional lower bounds on well-known problems such as
edit distance (Backurs and Indyk, 2015), Frechet distance (Bringmann, 2014), dynamic time warping
(Bringmann and Künnemann, 2015), longest common subsequence (LCS) (Abboud et al., 2015),
Hausdorff distance (Bringmann and Nusser, 2021), and string matching (Abboud et al., 2018). In the
context of machine learning and massive data analysis, reductions from SETH have been fruitfully
applied to problems such as clustering (Abboud et al., 2019), kernel PCA (Backurs et al., 2017), and
approximate nearest neighbors (Rubinstein, 2018).

3. Notations and Preliminaries


An ordered finite set of n vectors in Rd will be denoted as an n × d matrix whose rows denote the
elements of the set respectively. We use upper case characters to denote both vector sets and matrices
depending on the context. We use Ai to denote the ith row of matrix A, or the ith element of ordered
set A. We use Aij to denote the element at the ith row and j th column of matrix A. For a positive
integer n ∈ Z+ , [n] denotes the set of all positive integers up to n.

3.1. Background on Self-Attention


The well-established Transformer model (Vaswani et al., 2017) is based on the multi-head attention
mechanism, comprising several self-attention layers running in parallel. The canonical choice of
self-attention is the softmax dot-product self-attention, defined as follows. For a given set of inputs
written as X ∈ Rn×d and trainable parameter matrices Wq ∈ Rd×dq , Wk ∈ Rd×dk , Wv ∈ Rd×dv ,
this operation first calculates the query (Q = XWq ), key (K = XWk ), and value (V = XWv )
matrices respectively. We assume that dq = dk . The size of Q and K is then n × dk , while the size
of V is n × dv . The softmax dot-product self-attention operation is defined as:

QK T
 
Attention(Q, K, V ) = softmax √ V. (1)
dk
Let us consider the case where n is very large compared to dk , dv . By very virtue of its definition,
2 ) time to compute this self-attention operation: 1) calculation of
√ to incur O(n
we might expect
S = QK / dk takes O(n dk ), 2) exponentiation and calculation of row sum of S takes O(n2 ) time,
T 2

3) division of each element of S with the corresponding row sum takes O(n2 ), and 4) multiplication
of softmax(QK T ) and V takes O(n2 dv ) time. Therefore, the computational complexity of this naive
approach to compute self-attention scales quadratically in n.
A generalized form of self-attention. While Eq. 1 is the typical way to define self-attention, we
also consider a more general form. Let f : Rd × Rd → R be a function that takes two vectors in
Rd as input. Then, the self-attention score matrix S is defined as Sij = f (Qi , Kj ) for all i, j ∈ [n].
Also, let h : Rn×n → Rn×n be some kind of normalization function. Then a more abstract definition
of self-attention can be expressed as:

Attention(Q, K, V ) = h(S) · V. (2)

4
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

In particular, for the softmax dot-product


√ self-attention, the function f is the dot-product of given
vectors with normalization factor dk , and h is the row-wise softmax function.
In this paper, we show that there is no (provably) better algorithm than the naive O(n2 ) approach
for calculating softmax dot-product self-attention, given hardness of SETH. We will also give similar
lower bounds for various approximate forms of self-attention. Finally, we will investigate the
computational complexity of generalized self-attention for more general forms of f .

3.2. SETH and OVP


Despite a remarkable amount of algorithmic effort on Boolean satisfiability (SAT) and related
problems, to date no one has invented an algorithm with faster-than-exponential (O(2n )) running time;
indeed, there is no polynomial-time algorithm for SAT unless P = N P . The Strong Exponential
Time Hypothesis (SETH) (Impagliazzo and Paturi, 2001; Impagliazzo et al., 2001) can be viewed
as an strengthening of this statement: for every ϵ, there is no sub-exponential (O(2n(1−ϵ) )) time
algorithm that solves SAT.
As discussed above in Section 2, over the last decade SETH has been used to provide fine-grained
lower bounds for several polynomial-time problems. Many of these results use reduction from an
intermediate problem, given as follows.

Definition 1 (Orthogonal Vectors Problem (OVP)) Two sets with cardinality A = {a1 , . . . , an }
and B = {b1 , . . . , bn } are given, where ai , bi ∈ {0, 1}d are binary vectors for all i ∈ [n]. The
problem decides if there exists at least one pair of vectors a ∈ A and b ∈ B such that aT b = 0.

Previous work has that for all ϵ > 0, there is no O(n2−ϵ ) algorithm solves OVP for d = ω(log n)
unless SETH is false (Williams, 2005). In other words, for any ϵ > 0, problem needs at least O(n2−ϵ )
time for d = ω(log n). In this work, we will primarily give lower bounds to the computational
complexity of self-attention mechanism by showing reductions from OVP.

4. Hardness of Computing Self-Attention


4.1. Adaptations of OVP
Based on the definition of generalized self-attention (Eq. 2), we focus on two main forms of self-
T
attention: (a) dot-product self-attention with f (x, y) = eCx y , (b) ℓ2 self-attention (or RBF kernel
2
self-attention) with f (x, y) = e−C.∥x−y∥2 for some temperature/scale parameter C to be specified
later, where x, y are row vectors from matrices Q, K respectively. We provide hardness guarantees
for a variety of self-attention mechanisms built on top of these two types of self-attention. We achieve
this by deriving reductions from two fundamental problems stated in Definitions 2 and 3.

Definition 2 (Threshold Vectors Product Problem (TVPP)) Two sets with equal cardinality A =
{a1 , . . . , an } and B = {b1 , . . . , bn } are given, where ai , bi ∈ {0, 1}d are binary vectors for all
i ∈ [n]. The problem decides if there exists at least one pair a ∈ A and b ∈ B such that aT b ≥ t for
a given t ∈ [n].

Definition 3 (Bichromatic Hamming Close Pair Problem (BHCP)) Two sets with equal cardi-
nality A = {a1 , . . . , an } and B = {b1 , . . . , bn } are given, where ai , bi ∈ {0, 1}d are binary vectors
for all i ∈ [n]. The problem decides if there exists at least one pair of vectors a ∈ A and b ∈ B such
that ∥a − b∥2 < t.

5
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

Both TVPP and BHCP can be shown to be SETH-hard by showing reductions from OVP
(Definition 1) as shown in Lemmas 1, 2, and 3. The hardness of BHCP is an established result
(Backurs et al., 2017), but we provide an improved reduction from OVP to BHCP via a new problem:
Bichromatic Hamming Far Pair problem (BHFP). In contrast to established reductions from OVP, we
get rid of additional factors of d (the dimension of binary vectors) that incur during the process.

Definition 4 (Bichromatic Hamming Far Pair Problem (BHFP)) Two sets with equivalent car-
dinality A = {a1 , . . . , an } and B = {b1 , . . . , bn } are given, where ai , bi ∈ {0, 1}d are binary
vectors for all i ∈ [n]. The problem decides if there exists at least a pair of vectors a ∈ A and b ∈ B
such that ∥a − b∥2 ≥ t.

Before discussing the hardness of computing self-attention, we state the hardness guarantees of
TVPP, BHFP, and BHCP formally, with proofs deferred to Appendix A.

Lemma 1 Assume SETH. Then for any ϵ > 0, the computational complexity of TVPP is Ω(n2−ϵ )
for d = ω(log n).

Lemma 2 Assume SETH. Then for any ϵ > 0, the computational complexity of BHFP is Ω(n2−ϵ )
for d = ω(log n).

Lemma 3 Assume SETH. Then for any ϵ > 0, the computational complexity of BHCP is Ω(n2−ϵ )
for d = ω(log n).

Below, we show a series of reductions from TVPP and BHCP to several well-studied self-
attention mechanisms. We mainly modify the functions f (.) and h(.) (see√ Eq. 2) to reflect each
attention mechanism in our arguments. We ignore the scaling factor 1/ dk when computing the
function f (.) in dot-product
√ self-attention for easier exposition as we only require to scale every
element of Q or K by 1/ dk which takes O(ndk ) time. Also, we note that several approaches for
efficient transformers were developed with the particular case of Q = K (Kitaev et al., 2020), and
our hardness results are valid for this specific instance where Q = K and by direct reduction, and
hardness guarantees hold for any universal (Q, K) as well. Also, all the process is still valid for
multi-head self attention which is proved in Appendix F.

4.2. Vector Gadgets for Reductions


Our arguments follow by showing reductions from TVPP and BHCP instances to self-attention
instances. We define a set of vector gadgets that convert input to TVPP and BHCP into inputs to
self-attention functions in the following manner.
TVPP vector gadgets. Let A = {a1 , . . . , an } and B = {b1 , . . . , bn } given in TVPP, where
ai , bi ∈ {0, 1}d are binary vectors for all i ∈ [n]. We construct our vector gadgets in the following
way. First, create the matrix Q ∈ R2n×d with its rows as Qi = ai for all i ∈ [n], and Qn+j = Cbj
for all j ∈ [n], where C > 0 is a parameter which we will define in the corresponding reductions.
Then we create the matrix V ∈ R2n×1 by setting first n elements to 0 and the second n elements to
1.

6
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

BHCP vector gadgets. Let A = {a1 , . . . , an } and B = {b1 , . . . , bn } given in BHCP, where
ai , bi ∈ {0, 1}d are binary vectors for all i ∈ [n]. We construct our vector gadgets in the following
way. First, create the matrix Q ∈ R2n×d with its rows as Qi = ai for all i ∈ [n], and Qn+j = bj for
all j ∈ [n]. Then we create the matrix V ∈ R2n×1 by setting first n elements to 0 and the second n
elements to 1.
The entire construction process (including multiplying each ai , bi by C) of vector gadgets takes
O(nd) time. We use the above gadgets in the following reductions.

4.3. Hardness of Dot-Product Self-Attention


We begin with hardness results of dot-product self-attention (without softmax normalization) as a
warm-up. In Theorem 4 we show that exact self-attention, as well as element-wise multiplicative-
and additive-error approximations of self-attention, all require quadratic time, conditioned on SETH.
T
Theorem 4 Assume SETH. For any i, j ∈ [n], let Sij = f (Qi , Qj ) = eQi Qj and for any matrix
M ∈ Rn×n , let h(M ) = M . Let Y = h(S) · V ∈ Rn×dv be a self-attention mechanism. Provided
dq = ω(log n), for any ϵ > 0, computing a matrix Ŷ ∈ Rn×dv that satisfies any of the following
conditions requires Ω(n2−ϵ ) time.
1. Ŷ = Y (exact computation).
2. |Ŷij − Yij | ≤ µ|Yij | for all i ∈ [n] and j ∈ [dv ] where 0 ≤ µ < 1 (multiplicative approximation).
3. |Ŷij − Yij | ≤ µ for all i ∈ [n] and j ∈ [dv ] where 0 ≤ µ (additive approximation).

Proof Suppose that the matrices Q and V are constructed as TVPP vector gadgets described above
in Section 4.2. With this, we have Y = h(S) · V ∈ Rn×1 . Since Y is a vector, we slightly abuse
notation and define Yi as the ith element of Y . Now consider the first n elements of Y . Since
Y = SV , we have that
n
T
X
Yi = eCai bj for any i ∈ [n].
j=1

Now we check the magnitude of Yi , i ∈ [n] in order to distinguish between true and false cases
in TVPP. It takes O(n) time to check each Yi , i ∈ [n]. First we focus on the exact computation.
Consider the following two cases.
Case 1. There are no pairs a ∈ A and b ∈ B with aT b ≥ t, that is for all i, j ∈ [n], we have
T
aTi bj ≤ t − 1, and eCai bj ≤ eC(t−1) . Then for all l ∈ [n], Yl ≤ neC(t−1) := δ.
Case 2. There is a pair a ∈ A and b ∈ B with aT b ≥ t, that is for some i, j ∈ [n], we have
T
aTi bj ≥ t, and eCai bj ≥ eCt . Thus for some l ∈ [n], we have Yl ≥ eCt := ∆
To distinguish between the two cases, it is sufficient to have ∆ > δ. But this holds when
C = 2 log n.
Now let us consider multiplicative approximation error. With a µ-multiplicative factor, if there
are no pairs a ∈ A and b ∈ B with aT b ≥ t, then we have for all l ∈ [n], Ŷl ≤ (1 + µ)Yl ≤
(1 + µ)neC(t−1) := δ̂. On the other hand, if there is a pair a ∈ A and b ∈ B with aT b ≥ t, then for
some l ∈ [n], we have Ŷl ≥ (1 − µ)Yl ≥ (1 − µ)eCt := ∆. ˆ In order to distinguish between two
cases, it is sufficient to have ∆ > δ̂ and this inequality holds with C = 2 log( 1+µ
ˆ
1−µ n).
Finally we look at additive approximation error. With a µ-additive factor, if there are no pairs
a ∈ A and b ∈ B with aT b ≥ t, then we have for all l ∈ [n], Ŷl ≤ Yl + µ ≤ neC(t−1) + µ := δ̂. On
the other hand, if there is a pair a ∈ A and b ∈ B with aT b ≥ t, then for some l ∈ [n], we have

7
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

Ŷl ≥ Yl − µ ≥ eCt − µ := ∆. ˆ In order to distinguish between two cases, it is sufficient to have


∆ˆ > δ̂ and this inequality holds with C = 2 log(n + 2µ).
Thus, if there is an algorithm for computing self-attention up to an element-wise multiplicative
or additive error µ that runs in O(n2−ϵ ) time, this entire process takes at most O(n + nd + n2−ϵ ) =
O(n2−ϵ ) as long as d = o(n1−ϵ ). Therefore, this algorithm decides if there exists at least a pair of
vectors a ∈ A and b ∈ B such that aT b ≥ t in O(n2−ϵ ) time, which contradicts the hardness result
of TVPP (Lemma 1). This completes the proof.

The above proof is for computing the output of self-attention. As an easy consequence, we
can also show that computing the self-attention score matrix, S, with either exact or element-wise
multiplicative/additive error requires quadratic time, conditioned on SETH. The argument follows a
similar proof as Theorem 4, and is given in Appendix C in detail.

4.4. Hardness of Softmax Dot-Product Self-Attention


Now we establish hardness guarantees for computing standard softmax dot-product self-attention.
The difference from vanilla self-attention is that the function h(.) now normalizes input rows.
Discussion of additive approximation for this part is given in Appendix G.
T
Theorem 5 Assume SETH. For any i, j ∈ [n], let Sij = f (Qi , Qj ) = eQi Qj and for any matrix
M
M ∈ Rn×n , let h(M ) ∈ Rn×n be the matrix where for all i, j ∈ [n], {h(M )}ij = Pn ijMik . Let
k=1
Y = h(S) · V ∈ Rn×dv be a self-attention. Then provided dq = ω(log n), for any ϵ > 0, computing
a matrix Ŷ ∈ Rn×dv that satisfies any of the following conditions requires Ω(n2−ϵ ) time.
1. Ŷ = Y (exact).
2. |Ŷij − Yij | ≤ µ|Yij | for all i ∈ [n] and j ∈ [dv ] where 0 ≤ µ < 1 (multiplicative approximation).

Proof The proof is technically similar to the one above. Suppose that the matrices Q and V
are constructed according to TVPP vector gadgets described in Section 4.2. With this we have
Y = h(S) · V ∈ Rn×1 . Consider the first n elements of Y . Since h(.) act as the row-wise softmax
function, we have

CaT
2n n Pn
i bj
T
X X eCai bj j=1 e
Yi = Sij = = .
aT CaT aT CaT
Pn n n n
i ak + i bk i aj + i bj
P P P
j=n+1 j=1 k=1 e k=1 e j=1 e j=1 e

Again, first we focus on exact computation and consider two cases.


Case 1. There are no pairs a ∈ A and b ∈ B with aT b ≥ t, that is for all i, j ∈ [n], we
T x
have aTi bj ≤ t − 1, and eCai bj ≤ eC(t−1) . For a function x+y , the maximum value is achieved at
C(t−1) C(t−1)
maximum x and minimum y values. Thus, for all l ∈ [n], Yl ≤ nene e
C(t−1) +n = eC(t−1) +1 := δ.

Case 2. There is a pair a ∈ A and b ∈ B with aT b ≥ t, that is for some i, j ∈ [n], we have
T
aTi bj ≥ t. Then the row sum corresponding to that i, j pair is nj=1 eCai bj ≥ eCt + (n − 1)e0 =
P
x
eCt + (n − 1). For a function x+y , the minimum value is achieved at minimum x and maximum y
Ct T
values. Thus, for some l ∈ [n], we have Yl ≥ eCte+(n−1)+ne
+(n−1)
d := ∆ and e
Cai bj ≥ eCt .

In order to distinguish between two cases, it is sufficient to have ∆ > δ which means we require
[eCt + (n − 1)] > ned [eC(t−1)] . This holds with C = log n + d.

8
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

Next, consider multiplicative error approximation. Select same TVPP vector gadgets except
this time the matrix V ∈ R2n×1 is set first n elements to 1 and the second n elements to 0. Since
|Ŷl − Yl | ≤ µ|Yl | for all i ∈ [2n], we have (1 − µ)Yi ≤ Ŷi ≤ (1 + µ)Yi . Now consider the values of
Ŷi , i ∈ [n] in the following two cases.
Case 1. There are no pairs a ∈ A and b ∈ B with aT b ≥ t, that is for all i, j ∈ [n], we have
T
ai bj ≤ t − 1. This means that nj=1 eCai bj ≤ neC(t−1) . For a function x+y
T x
P
, the minimum value
is achieved at the minimum x and maximum y values. Thus, for all l ∈ [n], Yl ≥ n+nenC(t−1) =
1 1
eC(t−1) +1
which means that for all l ∈ [n], we have Ŷl ≥ (1 − µ)Yl ≥ (1 − µ) eC(t−1) +1
:= ∆.
T
Case 2. There is a pair a ∈ A and b ∈ B with a b ≥ t, that is for some i, j ∈ [n], we have
T
aTi bj ≥ t. Then the row sum corresponding to that i, j pair is nj=1 eCai bj ≥ eCt + (n − 1)e0 =
P
x
eCt + (n − 1). For a function x+y , the maximum is achieved at the maximum x and minimum y
ned
values. Thus, for some l ∈ [n], we have Yl ≤ eCt +(n−1)+ned
which means that for some l ∈ [n], we
ned
have Ŷl ≤ (1 + µ)Yl ≤ (1 + µ) eCt +(n−1)+ned := δ.
In order to distinguish between the two cases, it is sufficient to have ∆ > δ. This holds
with C = log( 2(1+µ)
1−µ n) + d. Thus, if there is an algorithm for computing self-attention up to an
element-wise multiplicative error µ that runs in O(n2−ϵ ) time, this entire process takes at most
O(n + nd + n2−ϵ ) = O(n2−ϵ ) as long as d = o(n1−ϵ ), and this algorithm decides if there exists a
pair of vectors a ∈ A, b ∈ B such that aT b ≥ t in O(n2−ϵ ) time, contradicting TVPP hardness.

Remark. For the multiplicative error approximations, we can set µ to a value arbitrarily close
to 1 with a dependence on n, i.e. µ = 1 − Θ(1/nx ) for a sufficiently large(constant) order x. For
2)
example, with x = 2, we require C > 2 log( 2−Θ(1/n
Θ(1/n2 )
n) ≈ 2[log 2 + 3 log(Θ(n))]. Therefore, to
distinguish between two cases with a µ multiplicative error approximation with µ = 1 − Θ(1/n2 ),
we only require C = O(log(n)). We discuss this matter in more detail in the Appendix H.

4.5. Hardness of Sliding Window Dot-Product Self-Attention


We now consider well-known less-expensive alternatives to standard self-attention. A popular
example is sliding window self-attention (Beltagy et al., 2020). Here, we evaluate an element of the
score matrix Sij only if the difference between i and j is within a fixed window size w; else we set it
to zero. This reduces the running time to O(nw), which can be small if the window size is small.
However, we show that such a rectangular complexity is unavoidable.
T
Theorem 6 Assume SETH. Let f : {Rd , Rd } −→ R as f (x, y) = ex y . For set Q of vectors
Q1 , . . . , Qn , we define the matrix S ∈ Rn×n as
(
f (Qi , Qj ) |i − j| ≤ w/2
(S)ij =
0 otherwise

Also for any matrix M ∈ Rn×n , let h(M ) = M . Let Y = h(S) · V ∈ Rn×dv be a self-attention.
Then for any ϵ > 0, computing a matrix Ŷ ∈ Rn×dv that satisfies any of the following conditions
requires Ω(nw1−ϵ ) time when dq = ω(log w) and w = ω(dq ).
1. Ŷ = Y (exact).

9
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

2. |Ŷij − Yij | ≤ µ|Yij | for all i ∈ [n] and j ∈ [dv ] where 0 ≤ µ < 1 (multiplicative approximation).
3. |Ŷij − Yij | ≤ µ for all i ∈ [n] and j ∈ [dv ] where 0 ≤ µ (additive approximation).

Proof See Appendix B for a full proof. First, a TVPP problem with sets size k = nw are
constructed. Then sliding-window self attention is computed for all the points in order, so that for
each TVPP problem, all the pairs are in the window. An appropriate value matrix V is constructed
for selection of this pairs. If the overall self attention can be calculated in O((nw)1−ϵ ) time, then
TVPP problem can be solved in O(k 2−ϵ ) which contradicts Lemma 1.

4.6. Hardness of ℓ2 - Self-Attention


We now establish hardness guarantees for computing ℓ2 -self-attention, which replaces a softmax with
an RBF kernel and is the core idea underlying both SOFT (Lu et al., 2021) and Skyformer (Chen
et al., 2021b), the current state-of-the-art in fast self-attention operations. In our argument, we adopt
a similar proof technique employed in (Backurs et al., 2017), who establish quadratic hardness of
kernel PCA assuming SETH. However, our proof involves a different chain of reductions: OVP →
BHFP → BHCP → kernel computation. Discussion of additivite approximation for this part is given
in Appendix G.
2
Theorem 7 Assume SETH. For any i, j ∈ [n], let Sij = f (Qi , Qj ) = eC.∥Qi −Qj ∥2 where C is
a parameter and for any matrix M ∈ Rn×n , let h(M ) = M . Let Y = h(S).V ∈ Rn×dv be a
self-attention. Then for any ϵ > 0, computing a matrix Ŷ ∈ Rn×dv that satisfies any of the following
conditions requires Ω(n2−ϵ ) time when dq = ω(log n).
1. Ŷ = Y (exact).
2. |Ŷij − Yij | ≤ µ|Yij | for all i ∈ [n] and j ∈ [dv ] where 0 ≤ µ < 1 (multiplicative approximation).

Proof (Sketch). The Q and V matrices and constructed according to BHCP vector gadgets described
in Section 4.2. With these, we define Y = h(S) · V ∈ Rn×1 . Considering the first n elements of
Y and a suitable selection of C, our goal is to make the two cases (whether there is a solution for
BHCP or not) distinguishable. Thus, if there is an algorithm for computing self-attention up to an
element-wise multiplicative error µ that runs in O(n2−ϵ ) time, this algorithm decides if there exists a
solution or not for BHCP in O(n2−ϵ ) time, which contradicts BHCP hardness (Lemma 3).

5. Polynomial Approximations of Self-Attention


We have shown (conditional) hardness results for a number of well-known self-attention mechanisms.
We conclude with some upper bounds. Given the query Q, key K, and value V matrices, we show
that when f (Qi , Kj ) is a polynomial of order p (p is an integer ≥ 0) of QTi Kj and h(.) is either the
identity function or row-wise normalization, one can compute self-attention in linear time. However,
now the time complexity scales exponentially with p. As a special case of this, we show that one
can approximate dot-product softmax self-attention in linear time in n by using finite Taylor series
k
approximation: ex ≈ pk=0 xk! . In what follows we use the fact that dq = dk and use dq in both
P
places. Recall the dimensions of matrices Q ∈ Rn×dq , K ∈ Rn×dq , V ∈ Rn×dv .
Lemma 8 shows that the product SV (without the row-wise normalization) can be computed in
linear time n. Lemma 9 shows that the denominator for row-wise normalization can be computed in

10
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

linear time in n. Once we have the denominator for each row, h(S) · V can be computed by dividing
each row of SV by the corresponding denominator.

Lemma 8 Let p be an integer ≥ 0 and let Sij be C · (QTi Kj )p where i, j ∈ [n] and C is a constant.
Then SV can be computed in O(ndpq dv ) time.
Pn
Lemma 9 Let p be an integer ≥ 0. Then for all i ∈ [n], ĵ=1
C · (QTi Kĵ )p , ĵ ∈ [n] where C is a
constant can be computed in O(ndpq ) time.

Proof sketch of Lemma 8 and 9. When f (Qi , Vj ) = QTi Vj , we directly multiply matrices K T and
V to obtain K T V in O(dq ndv ) time, then Q and K T V in O(ndq dv ) gives the desired matrix SV in
O(n) time. The sum of a row of S can be simply computed by storing P the sum of vectors KPj , j ∈ [n]
n T T n
in memory in O(ndq ) time and reusing this for each row i to compute j=1 Qi Kj = Qi j=1 Kj
T p
in O(ndq ) time. This idea can be extended to (Qi Kj ) .

Theorem 10 Let p be an integer ≥ 0. If Sij = f (Qi , Kj ) is a polynomial function of order p of


QTi Kj and h(.) performs row-wise normalization, then h(S) · V can be computed in O(ndpq dv ) time.

Proof The result is implied by Lemma P8p and Lemma 9. Let xij = QTi Kj ; i, j ∈ [n]. Define
z
the polynomial function of order p as z=0 cz xij where cz , z ∈ {0, . . . , p} are constants. Let
(z) (z)
S (1) , . . . , S (p) ∈ Rn×n where Sij = cz xzij . Now we can write Sij = pz=0 Sij , thus SV =
P
Pp (z) z
z=0 Sij V . By Lemma 8, each term of this summation can be computed in O(ndq dv ) time.
p
Therefore the overall time complexity of computing SV is O(ndq dv ) (considering the largest
exponent of dq is when z = p).
Let si be the sum of the elements of ith row of S. What remains is computing h(S).V . This
can be computed indirectly by first computing SV and the dividing each row i of SV by si . If we
have precomputed each si , i ∈ [n], then this process takes O(ndv ) time, since we are dividing each
element of SV (∈ Rn×dv ) by a scaler. What remains to show is that computing si , ∀i ∈ [n] takes
O(npdpq ) time. Observe that
n p
n X p X
n p X
n
(z) (z)
X X X X
si = Sij = Siĵ = Siĵ = cz xziĵ
ĵ=1 ĵ=1 z=0 z=0 ĵ=1 z=0 ĵ=1

From Lemma 9, each term of the outer summation over z can be computed O(ndzq ) time for all
i ∈ [n] and overall time complexity is O(ndpq )(taking the largest exponent of dq similarly).

Corollary 11 Let p be an non-negative integer. The pth order polynomial approximation of dot-
product softmax self-attention using matrices Q, K, V with finite Taylor series can be computed in
O(ndpq dv ) time.
1
Proof The result follows from replacing constants cz = z! for z = 0, . . . , p in Theorem 10.

11
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

6. Conclusions
In this paper we investigate fundamental bounds on the computational complexity of self-attention.
We examine numerous state-of-the-art self-attention mechanisms, and prove quadratic (or rectangular)
lower bounds assuming the Strong Exponential Time Hypothesis (SETH). Even though a large
number of recent works have proposed fast approximations to self-attention, our results imply that it
may be difficult to both overcome the quadratic runtime barrier while still retaining high accuracy. On
the positive side, we show that linear-time computation is possible if we choose the score computation
function in the form of a polynomial.
Our work leaves open several directions. At a high level our theorems establish a result between
‘exponential’ and ‘polynomial’ forms of self-attention, but having a clearer picture of the landscape
may be helpful. Moreover, our results are for worst-case inputs; similar hardness results on average-
case inputs is an interesting direction. Finally, we leave the door open for the possibility of randomized
algorithms that achieve sub-quadratic complexity and are correct with high probability.

Acknowledgments
This work was supported in part by the National Scienc Foundation (under grants CCF-2005804 and
CCF-1801495) and USDA/NIFA (under grant 2021-67021-35329).

References
Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for lcs and
other sequence similarity measures. In 2015 IEEE 56th Annual Symposium on Foundations of
Computer Science, pages 59–78. IEEE, 2015.

Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. If the current clique algorithms
are optimal, so is valiant’s parser. SIAM Journal on Computing, 47(6):2527–2555, 2018.

Amir Abboud, Vincent Cohen-Addad, and Hussein Houdrougé. Subquadratic high-dimensional


hierarchical clustering. Advances in Neural Information Processing Systems, 32, 2019.

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel
Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language
model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.

Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time
(unless seth is false). In Proceedings of the forty-seventh annual ACM symposium on Theory of
computing, pages 51–58, 2015.

Arturs Backurs, Piotr Indyk, and Ludwig Schmidt. On the fine-grained complexity of empirical risk
minimization: Kernel methods and neural networks. Advances in Neural Information Processing
Systems, 30, 2017.

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer.
CoRR, abs/2004.05150, 2020.

12
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

Karl Bringmann. Why walking the dog takes time: Frechet distance has no strongly subquadratic
algorithms unless seth fails. In 2014 IEEE 55th Annual Symposium on Foundations of Computer
Science, pages 661–670. IEEE, 2014.

Karl Bringmann. Fine-grained complexity theory: Conditional lower bounds for computational
geometry. In Conference on Computability in Europe, pages 60–70. Springer, 2021.

Karl Bringmann and Marvin Künnemann. Quadratic conditional lower bounds for string problems
and dynamic time warping. In 2015 IEEE 56th Annual Symposium on Foundations of Computer
Science, pages 79–97. IEEE, 2015.

Karl Bringmann and André Nusser. Translating hausdorff is hard: Fine-grained lower bounds
for hausdorff distance under translation. In 37th International Symposium on Computational
Geometry, 2021.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared
Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large
language models trained on code. CoRR, 2021a.

Yifan Chen, Qi Zeng, Heng Ji, and Yun Yang. Skyformer: Remodel self-attention with gaussian
kernel and nystr\" om method. Advances in Neural Information Processing Systems, 34, 2021b.

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea
Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al.
Rethinking attention with performers. In International Conference on Learning Representations,
2021.

Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: Pure
attention loses rank doubly exponentially with depth. In International Conference on Machine
Learning, pages 2793–2803. PMLR, 2021.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image
is worth 16x16 words: Transformers for image recognition at scale. In International Conference
on Learning Representations, 2021.

Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. Journal of Computer and
System Sciences, 62(2):367–375, 2001. ISSN 0022-0000.

Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have strongly exponen-
tial complexity? Journal of Computer and System Sciences, 63(4):512–530, 2001.

Piotr Indyk. Beyond p vs. np: quadratic-time hardness for big data problems. In Proceedings of the
29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 1–1, 2017.

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger,
Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate
protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.

13
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns:
Fast autoregressive transformers with linear attention. In International Conference on Machine
Learning, pages 5156–5165. PMLR, 2020.

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages
4171–4186, 2019.

Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and
Mubarak Shah. Transformers in vision: A survey. ACM Computing Surveys (CSUR), 2021.

Hyunjik Kim, George Papamakarios, and Andriy Mnih. The lipschitz constant of self-attention. In
International Conference on Machine Learning, pages 5562–5571. PMLR, 2021.

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In
International Conference on Learning Representations, 2020.

Valerii Likhosherstov, Krzysztof M Choromanski, Jared Quincy Davis, Xingyou Song, and Adrian
Weller. Sub-linear memory: How to make performers slim. Advances in Neural Information
Processing Systems, 34, 2021.

Jiachen Lu, Jinghan Yao, Junge Zhang, Xiatian Zhu, Hang Xu, Weiguo Gao, Chunjing XU, Tao
Xiang, and Li Zhang. Soft: Softmax-free transformer with linear complexity. In M. Ranzato,
A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural
Information Processing Systems, volume 34, pages 21297–21309. Curran Associates, Inc., 2021.

Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong.
Random feature attention. In International Conference on Learning Representations, 2021.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. In International Conference on Machine Learning,
pages 8748–8763. PMLR, 2021.

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse
attention with routing transformers. Transactions of the Association for Computational Linguistics,
9:53–68, 2021.

Aviad Rubinstein. Hardness of approximate nearest neighbor search. In Proceedings of the 50th
annual ACM SIGACT symposium on theory of computing, pages 1260–1268, 2018.

Aviad Rubinstein and Virginia Vassilevska Williams. Seth vs approximation. ACM SIGACT News,
50(4):57–76, 2019.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing
systems, 30, 2017.

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with
linear complexity. arXiv preprint arXiv:2006.04768, 2020.

14
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theoretical
Computer Science, 348(2):357–365, 2005.

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and
Vikas Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14138–14148,
2021.

Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are
transformers universal approximators of sequence-to-sequence functions? In International
Conference on Learning Representations, 2019.

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago
Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for
longer sequences. Advances in Neural Information Processing Systems, 33:17283–17297, 2020.

Appendix A. Proofs for Hardness of TVPP, BHCP and BHFP


Lemma 1. Assume SETH. Then for any ϵ > 0, the computational complexity of TVPP is Ω(n2−ϵ )
for d = ω(log n).
Proof Consider the two sets A = {a1 , . . . , an } and B = {b1 , . . . , bn } given in OVP, where
ai , bi ∈ {0, 1}d are binary vectors for all i ∈ [n]. For any vector ai ∈ A, let us define āi ∈ R2d as
the concatenation of vector ai and vector (1 − ai ), where 1 := (1, 1, . . . , 1) ∈ Rd . Then we have
∥āi ∥1 = ∥ai ∥1 + ∥1 − ai ∥1 = d. Now define Ā as the set of āi s. For any vector bj ∈ B, let us
define b¯j ∈ R2d as the concatenation of vector (1 − bj ) and vector 1. Now define B̄ as the set of b¯j .
Because the overall dimensions of A, B and Ā, B̄ are nd and 2nd respectively, this process takes
O(nd) time. Now for any i, j ∈ [n] we have

āi T b¯j = ai T (1 − bj ) + (1 − ai )T 1 = ∥ai ∥1 − aTi bj + ∥1 − ai ∥1 = d − aTi bj .


āi T b¯j = d if and only if ai T bj = 0. Now if we run the algorithm for TVPP with the threshold
t = d on the sets Ā and B̄ to find a pair ā ∈ Ā and b̄ ∈ B̄ that satisfies āT b̄ ≥ d, then we can
conclude that there is a pair a ∈ A and b ∈ B that satisfies aT b = 0.
Thus, if there is an algorithm for TVPP that runs in O(n2−ϵ ) time, this entire process takes at
most O(nd + n2−ϵ ) = O(n2−ϵ ) as long as d = o(n1−ϵ ), and this algorithm decides if there exists
at least a pair of vectors a ∈ A and b ∈ B such that aT b = 0 in O(n2−ϵ ) time, which contradicts
the hardness result of OVP. As a result, for at least one t, TVPP problem cannot be solved in
O(n2−ϵ ) time. (We note that for t = 1, this problem have a linear time algorithm. Selecting rows of
matrix Q as elements of set A, rows of matrix K as elements of set B, and rows of matrix V as 1,
then calculating Y = QK T V takes linear time on n, by firstly calculating K T V , then calculating
Q(K T V ). If there is any positive value of Y , then we can conclude that there is a pair a ∈ A and
b ∈ B that satisfies aT b ≥ 1.)

15
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

Lemma 2. Assume SETH. Then for any ϵ > 0, the computational complexity of BHFP is Ω(n2−ϵ )
for d = ω(log n).
Proof Consider the two sets A = {a1 , . . . , an } and B = {b1 , . . . , bn } given in OVP, where
ai , bi ∈ {0, 1}d are binary vectors for all i ∈ [n].
For any vector ai ∈ A, let us define āi ∈ R3d as the concatenation of vector ai , vector (1 − ai )
and vector 0 where 1 := (1, 1, . . . , 1) ∈ Rd and 0 := (0, 0, . . . , 0) ∈ Rd . Then we have āi T āi =
ai T ai + (1 − ai )T (1 − ai ) = d. Now define Ā as the set of āi s. For any vector bj ∈ B, let us
define b¯j ∈ R3d as the concatenation of vector bj , vector 0, and vector (1 − bj ). Then we have
T
b¯j b¯j = bj T bj + (1 − bj )T (1 − bj ) = d. Now define B̄ as the set of b¯j . Because the overall
dimensions of A, B and Ā, B̄ are nd and 3nd respectively, this process takes O(nd) time. Now for
any i, j ∈ [n] we have

āi T b¯j = ai T bj + (1 − ai )T 0 + 0T (1 − bj ) = ai T bj .
The squared ℓ2 distance between āi and b¯j for any i, j ∈ [n] is
T
∥āi − b¯j ∥22 = (āi − b¯j )T (āi − b¯j ) = āi T āi + b¯j b¯j − 2āi T b¯j = d + d − 2āi T b¯j = 2d − 2ai T bj .

∥āi − b¯j ∥22√= 2d if and only if ai T bj = 0. Now if we run the algorithm for BHFP with√the
threshold t = 2d on the sets Ā and B̄ to find a pair ā ∈ Ā and b̄ ∈ B̄ that satisfy ∥ā − b̄∥2 ≥ 2d,
then we can conclude that there is a pair a ∈ A and b ∈ B that satisfies aT b = 0.
Thus, if there is an algorithm for BHFP that runs in O(n2−ϵ ) time, this entire process takes at
most O(nd + n2−ϵ ) = O(n2−ϵ ) as long as d = o(n1−ϵ ), and this algorithm decides if there exists at
least a pair of vectors a ∈ A and b ∈ B such that aT b = 0 in O(n2−ϵ ) time, which contradicts the
hardness result of OVP. As a result, for at least one t, BHFP problem cannot be solved in O(n2−ϵ )
time.

Lemma 3. Assume SETH. Then for any ϵ > 0, the computational complexity of BHCP is Ω(n2−ϵ )
for d = ω(log n).
Proof Consider the two sets A = {a1 , . . . , an } and B = {b1 , . . . , bn } given in BHFP, where
ai , bi ∈ {0, 1}d are binary vectors for all i ∈ [n]. Let t, t′ be the given threshold for BHFP and
BHCP respectively.
For any vector bj ∈ B, let us define b¯j ∈ Rd as (1 − bj ), where 1 := (1, 1, . . . , 1) ∈ Rd . Now
define B̄ as the set of b¯j . Because the overall dimension of B and B̄ is nd, this process takes O(nd)
time. The squared ℓ2 distance between ai and b¯j for any i, j ∈ [n] is

∥ai − b¯j ∥22 = ∥ai − (1 − bj )∥22 = d − ∥ai − bj ∥22 .


∥ai − b¯j ∥22 < d − t2√+ 1 if and only if ∥ai − bj ∥2 ≥ t. Now if we run the algorithm for BHCP
with the threshold
√ t′ = d − t2 + 1 on the sets A and B̄ to find a pair a ∈ Ā and b̄ ∈ B̄ that satisfy
∥a − b̄∥2 < d − t2 + 1, then we can conclude that there is a pair a ∈ A and b ∈ B that satisfies
∥a − b∥2 ≥ t.
Thus, if there is an algorithm for BHCP that runs in O(n2−ϵ ) time, this entire process takes at
most O(nd + n2−ϵ ) = O(n2−ϵ ) as long as d = o(n1−ϵ ), and this algorithm decides if there exists at
least a pair of vectors a ∈ A and b ∈ B such that ∥a − b∥2 ≥ t in O(n2−ϵ ) time, which contradicts

16
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

the hardness result of BHFP. As a result, for at least one t′ , BHCP problem cannot be solved in
O(n2−ϵ ) time.

Appendix B. Proof for Hardness of Sliding Window Dot-Product Self-Attention


T
Theorem 3. Assume SETH. Let f : {Rd , Rd } −→ R as f (x, y) = ex y . For set Q of vectors
Q1 , . . . , Qn , we define the matrix S ∈ Rn×n as
(
f (Qi , Qj ) |i − j| ≤ w/2
(S)ij =
0 otherwise

Also for any matrix M ∈ Rn×n , let h(M ) = M . Let Y = h(S) · V ∈ Rn×dv be a self-attention.
Then for any ϵ > 0, computing a matrix Ŷ ∈ Rn×dv that satisfies any of the following conditions

requires Ω((nw)1−ϵ ) time when dq = ω(log nw) and w = ω(dq ).
1. Ŷ = Y (exact).
2. |Ŷij − Yij | ≤ µ|Yij | for all i ∈ [n] and j ∈ [dv ] where 0 ≤ µ < 1(multiplicative approxima-
tion).
3. |Ŷij − Yij | ≤ µ for all i ∈ [n] and j ∈ [dv ] where 0 ≤ µ(additive approximation).
Proof Consider the two sets A = {a1 , . . . , ak } and B = {b1 , . . . , bk } given in TVPP, where

k = nw and ai , bi ∈ {0, 1}d are binary vectors for all i ∈ [k]. The problem is to decide if there
exists at least a pair a ∈ A and b ∈ B that satisfies aT b ≥ t for a given t ∈ [d].
We construct our matrices Q and V in the following way. Firstly, define the function (α
(mod k)) := m if α ≡ m (mod k) and m ∈ [k]. Create the matrix Q ∈ R3n×d with its
rows with even indices as Q2α = Cb(α (mod k)) , and its rows with odd indices as Q2α−1 =
a(α (mod k))+w⌊ 2α−1 ⌋ . Thus, this process takes O(nd) time. By this construction, each (ai , bj )
2k
pair is found at a distance w.
Also, select V as concatenation of vector (1, 0, . . . , 1, 0) ∈ R3n .
With this we have Y = h(S) · V ∈ R3n×1 . Since Y is a vector, we abuse the notation again and
define Yi as the ith element of Y .
Now consider the first n even rows of Y . Because odd rows of matrix Q is from set A, and even
rows of matrix Q is from set B, the even rows of Y becomes the summation of the exponential of
the C times of the dot products of w different (ai , bj ) pairs by the definition of vector V .
Also, each (ai , bj ) pair is found at a distance w, so that the exponential of the C times dot
products of all pairs appear in the sliding window attention score matrix and contributes Y2l for at
least an l ∈ [n] value.
Now we check the magnitudes of Y2l for l ∈ [n] in order to distinguish between true and false
cases in TVPP. It takes O(n) time to check these n values. First, we focus on the exact computation.
Consider the following two cases.
Case 1. There are no pairs a ∈ A and b ∈ B with (a)T b ≥ t, that is for all i, j ∈ [k], we have
T
(ai )T bj ≤ t − 1, and eC(ai ) bj ≤ eC(t−1) . Then for all l ∈ [n], we have Y2l ≤ weC(t−1) := δ.
Case 2. There is a pair a ∈ A and b ∈ B with aT b ≥ t, that is for some i, j ∈ [k], we
T
have (ai )T bj ≥ t, and eC(ai ) bj ≥ eCt . Because any pair appears in some element of Y , we have
Y2l ≥ eCt := ∆ for some odd l ∈ [n].

17
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

In order to distinguish between two cases, it is sufficient to have ∆ > δ. This holds with
C = 2 log w.
Now let us consider multiplicative approximation error. With a µ-multiplicative factor, if there
are no pairs a ∈ A and b ∈ B with aT b ≥ t, then we have for all l ∈ [n], Ŷ2l ≤ (1 + µ)Y2l ≤
(1 + µ)weC(t−1) := δ̂. On the other hand, if there is a pair a ∈ A and b ∈ B with aT b ≥ t, then for
some l ∈ [n], we have Ŷ2l ≥ (1 − µ)Y2l ≥ (1 − µ)eCt := ∆. ˆ In order to distinguish between two
ˆ > δ̂ and this inequality holds with C = 2 log( 1+µ w).
cases, it is sufficient to have ∆ 1−µ
Finally we look at additive approximation error. With a µ-additive factor, if there are no pairs
a ∈ A and b ∈ B with aT b ≥ t, then we have for all l ∈ [n], Ŷ2l ≤ Y2l + µ ≤ weC(t−1) + µ := δ̂.
On the other hand, if there is a pair a ∈ A and b ∈ B with aT b ≥ t, then for some l ∈ [n], we have
Ŷ2l ≥ Y2l − µ ≥ eCt − µ := ∆. ˆ In order to distinguish between two cases, it is sufficient to have
∆ˆ > δ̂ and this inequality holds with C = 2 log(w + 2µ).
Thus, if there is an algorithm for computing self-attention up to an element-wise multiplicative
or additive error µ that runs in O(k 2−ϵ ) time, this entire process takes at most O(nd + k 2−ϵ ) =
√ 2−ϵ
O( nw ) = O((nw)1−ϵ ) as long as d = o(w1−ϵ ). Therefore, this algorithm decides if there
exists at least a pair of vectors a ∈ A and b ∈ B such that aT b ≥ t in O((nw)1−ϵ ) time, which
contradicts the hardness result of TVPP (Lemma 1). This completes the proof.

A similar proof also applies for dilated sliding window (Beltagy et al., 2020), where the self-
attention score is calculated as Theorem 6. Also, when the self-attention score is the softmax dot
product (where softmax is only applied to the window size in each row), one can prove O(nw1−ϵ )
complexity by following the proof of Theorem 5.

Appendix C. Proofs for Hardness of Self-Attention Score Matrix S Approximation


T
Theorem 12 Assume SETH. For any i, j ∈ [n], let Sij = f (Qi , Qj ) = eQi Qj and for any matrix
M ∈ Rn×n , let h(M ) = M . Ŝ ∈ Rn×n satisfies any of the following conditions:

1. |Ŝij − Sij | ≤ µ|Sij | for all i ∈ [n] and j ∈ [dv ] where 0 ≤ µ < 1(multiplicative approxima-
tion).

2. |Ŝij − Sij | ≤ µ for all i, j ∈ [n], where 0 ≤ µ(additive approximation).

Let Y = h(Ŝ).V ∈ Rn×dv be a self-attention. Then for any ϵ > 0, computing self-attention
Y ∈ Rn×dv requires Ω(n2−ϵ ) time when dq = ω(log n).

Proof Consider two sets A = {a1 , . . . , an } and B = {b1 , . . . , bn } given in TVPP, where ai , bi ∈
{0, 1}d are binary vectors for all i ∈ [n]. The problem is to decide if there exists at least a pair a ∈ A
and b ∈ B that satisfies aT b ≥ t for a given t ∈ [n].
Suppose that the matrices Q and V are constructed according to TVPP vector gadgets described
in the section 4.2. With this we have Y = h(Ŝ).V ∈ Rn×1 . Since Y is a vector, we abuse the
notation again and define Yi as the ith element of Y . Now consider the first n elements of Y . Since
Y = SV , we have that
Xn
Yi = Ŝij for any i ∈ [n].
j=1

18
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

Now we check the magnitude of Yi , i ∈ [n] in order to distinguish between true and false cases
in TVPP. It takes O(n) time to check each Yi , i ∈ [n]. First we focus on the multiplicative error
approximation. Consider the following two cases.
Case 1. There are no pairs a ∈ A and b ∈ B with aT b ≥ t, that is for all i, j ∈ [n], we have
T T
aTi bj ≤ t − 1, and Sij = eCai bj ≤ eC(t−1) , so Ŝij ≤ (1 + µ)Sij = (1 + µ)eCai bj ≤ (1 + µ)eC(t−1) .
Then for all l ∈ [n], Yl ≤ n(1 + µ)eC(t−1) := δ.
Case 2. There is a pair a ∈ A and b ∈ B with aT b ≥ t, that is for some i, j ∈ [n], we have
T T
ai bj ≥ t, and Sij = eCai bj ≥ eCt , so Ŝij ≥ (1 − µ)Sij = (1 − µ)eCai bj ≥ (1 − µ)eCt . Then for
T

some l ∈ [n], we have Yl ≥ (1 − µ)eCt + n − 1 := ∆


In order to distinguish between two cases, it is sufficient to have ∆ > δ. This holds with
C = 2 log( 1+µ
1−µ n).
Now let us look at the additive error approximation. With a µ additive factor, if there are no
pairs a ∈ A and b ∈ B with aT b ≥ t, then we have for all l ∈ [n], Yl ≤ neC(t−1) + nµ := δ̂. On
the other hand, if there is a pair a ∈ A and b ∈ B with aT b ≥ t, then for some l ∈ [n], we have
Yl ≥ eCt + n − 1 − µ := ∆. ˆ > δ̂
ˆ In order to distinguish between two cases, it is sufficient to have ∆
and this inequality holds with C = 2 log(n + 2µ).
Thus, if there is an algorithm for computing self-attention up to an element-wise multiplicative
or additive error µ that runs in O(n2−ϵ ) time, this entire process takes at most O(n2−ϵ ), and this
algorithm decides if there exists at least a pair of vectors a ∈ A and b ∈ B such that aT b ≥ t in
O(n2−ϵ ) time, which contradicts the hardness result of TVPP (Lemma 1). This completes the proof.

Similar proofs also work to show the quadratic complexity of multiplicative approximation to the
S of softmax dot-product self-attention, and ℓ2 -self-attention directly from Theorem 5 and Theorem
7. And also by Theorem 6, one can show the quadratic complexity of additive and multiplicative
approximation to the S of sliding window dot-product self-attention.

Appendix D. Proofs of Lemmas for Polynomial Approximations of Self-Attention


Lemma 4. Let p be an integer ≥ 0 and let Sij be C · (QTi Kj )p where i, j ∈ [n] and C is a
constant. Then SV can be computed in O(ndpq dv ) time.
Proof We omit the constant C in this proof since we can multiply any of the matrices Q, K, V by C
in O(ndq ) or O(ndv ) time before the rest of the computation.
When p = 1, SV = QK T V thus SV can be trivially computed by first computing K T V by
multiplying K T and V with O(dq ndv ) time, then by multiplying Q and K T V with O(ndq dv ) time.
For p ≥ 2, Consider a fixed (i, j) pair. Now
dq
X p  p dq
X
Sij = (QTi Kj )p = xr yr = x1 y1 + · · · + xdq ydq = xr1 yr1 . . . xrp yrp
r=1 r1 ,...,rp =1
(3)
dq
X
= (xr1 . . . xrp )(yr1 . . . yrp )
r1 ,...,rp =1

where x = Qi , y = Kj and xr , yr denote the rth element of the corresponding vectors. Given a
p
vector v ∈ Rdq , let us define a function α : Rdq → Rdq where elements of α(v) are computed

19
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

by element-wise multiplication vr1 . . . vrp , r1 , . . . , rp ∈ [dq ](ordered p-permutations of dq with


p
replacement). With this, let us define Q̂ ∈ Rn×dq (and K̂) where each row i is computed with
α(Qi )(respectively α(Kj ) for K̂). Q̂, K̂ matrices can be computed in O(ndpq ) time. From Eq.3, it
is evident that (QTi Kj )p = Q̂Ti K̂j . Now, similar to the case p = 1 one can compute SV by first
computing K̂ T V by multiplying K̂ T and V with O(dpq ndv ) time, then by multiplying Q̂ and K̂ T V
with O(ndpq dv ) time. This completes the proof.

Lemma 5. Let p be an integer ≥ 0. Then for all i ∈ [n], nĵ=1 C · (QTi Kĵ )p , ĵ ∈ [n] where C is
P

a constant can be computed in O(ndpq ) time.


Proof We omit the constant C in this proof since we can multiply any of the matrices Q, K by C in
O(ndq ) time before the rest of the computation.
p
Let Q̂, K̂ ∈ Rn×dq be the matrices computed by applying α(.) on rows of Q, V as stated in the
proof of Lemma 8(from Eq. 3). The time complexity of computing Q̂, K̂ is O(ndpq ).
T T Pn
Now we have n (QT K )p = n Q̂i Kˆĵ = Q̂i Kˆĵ. Let A = n Kˆĵ and A
P P P
ĵ=1 i ĵ ĵ=1 ĵ=1 ĵ=1
p
can be computed in O(ndq ) time(taking the summation of n dpq size vectors). We store the value of
T
A in the memory and reuse it. One can simply compute the inner product of Q̂i and A in O(dpq )
time per i. For all i ∈ [n], this takes O(ndpq ) time which gives the desired time complexity.

Appendix E. Proof for Hardness of ℓ2 -Self-Attention


2
Theorem 4. Assume SETH. For any i, j ∈ [n], let Sij = f (Qi , Qj ) = eC.∥Qi −Qj ∥2 where C
is a parameter and for any matrix M ∈ Rn×n , let h(M ) = M . Let Y = h(S).V ∈ Rn×dv be a
self-attention. Then for any ϵ > 0, computing a matrix Ŷ ∈ Rn×dv that satisfies any of the following
conditions requires Ω(n2−ϵ ) time when dq = ω(log n).
1. Ŷ = Y (exact).
2. |Ŷij − Yij | ≤ µ|Yij | for all i ∈ [n] and j ∈ [dv ] where 0 ≤ µ < 1(multiplicative approximation).

Proof Consider two sets A = {a1 , . . . , an } and B = {b1 , . . . , bn } given in BHCP, where ai , bi ∈
{0, 1}d are binary vectors for all i ∈ [n]. The problem is to decide if there exists at least a pair a ∈ A
and b ∈ B that satisfies ∥a − b∥22 < t for a given t ∈ [n].
Suppose that the matrices Q and V are constructed according to BHCP vector gadgets described
in the section 4.2. With this we have Y = h(S).V ∈ Rn×1 . Since Y is a vector, we abuse the
notation again and define Yi as the ith element of Y . Now consider the first n elements of Y . Since
Y = SV , we have that
n
X 2
Yi = e−C∥ai −bj ∥2 for any i ∈ [n].
j=1

Now we check the magnitude of Yi , ∈ [n] in order to distinguish between true and false cases
in BHCP. It takes O(n) time to check each Yi , i ∈ [n]. First we focus on the exact computation.
Consider the following two cases.

20
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

Case 1. There are no pairs a ∈ A and b ∈ B with ∥ai − bj ∥22 ≤ t, that is for all i, j ∈ [n], we
2
have ∥ai − bj ∥22 ≥ t, and e−C∥ai −bj ∥2 ≤ e−Ct . Then for all l ∈ [n], Yl ≤ ne−Ct := δ.
Case 2. There is a pair a ∈ A and b ∈ B with ∥a − b∥22 < t, that is for some i, j ∈ [n],
2
we have ∥ai − bj ∥22 ≤ t − 1, and e−C∥ai −bj ∥2 ≥ e−C(t−1) . Thus for some l ∈ [n], we have
Yl ≥ e−C(t−1) := ∆
In order to distinguish between two cases, it is sufficient to have ∆ > δ. This holds with
C = 2 log n.
Now let us look at the multiplicative error approximation. With a µ multiplicative factor, if
there are no pairs a ∈ A and b ∈ B with aT b ≥ t, then we have for all l ∈ [n], Ŷl ≤ (1 + µ)Yl ≤
(1 + µ)ne−Ct := δ̂. On the other hand, if there is a pair a ∈ A and b ∈ B with aT b ≥ t, then for
some l ∈ [n], we have Ŷl ≥ (1 − µ)Yl ≥ (1 − µ)e−C(t−1) := ∆. ˆ In order to distinguish between
ˆ 1+µ
two cases, it is sufficient to have ∆ > δ̂ and this inequality holds with C = 2 log( 1−µ n).
Thus, if there is an algorithm for computing self-attention up to an element-wise multiplicative
error µ that runs in O(n2−ϵ ) time, this entire process takes at most O(n + nd + n2−ϵ ) = O(n2−ϵ )
as long as d = o(n1−ϵ ), and this algorithm decides if there exists at least a pair of vectors a ∈ A and
b ∈ B such that ∥ai − bj ∥22 ≥ t in O(n2−ϵ ) time, which contradicts the hardness result of BHCP
(Lemma 3). This completes the proof.

Here in the proof we set C = 2 log( 2(1+µ)


1−µ n) however, the parameter C in the RBF kernel is
predefined, therefore the hardness result is valid only conditioned on this specific C. We can bypass
this by simply defining C = log( 2(1+µ)
1−µ n) = αβ, then setting α to the predefined constant of the
RBF function and solving this equation for β. After this, all that remains is modifying√the BHCP
vector gadget(see section 4.2) by multiplying each vector Qi , i ∈ [2n] by the scaler β in O(n)
time, and we obtain the desired hardness result for general RBF kernel.

Appendix F. Multi-Head Self-Attention


Lemma 6. k parallel OVP (or TVPP, or BHCP, or BHFP) problems (each has two sets with n
binary vectors) require O(kn2−ϵ ) time for any ϵ.
Proof Suppose there is an algorithm for k parallel OVP (TVPP, BHCP, BHFP) problems better than
O(kn2−ϵ ) time.
Say Ai and Bi are the sets of binary vectors with size n for any i ∈ [k].
For each t ∈ [k], look at these k parallel OVP (TVPP, BHCP, BHFP) problems:

(A1 , Bt ), (A2 , Bt+1 ), · · · , (Ak , Bt+k−1 ), where Bl+k = Bl

Because of the assumption, there is an algorithm better than O(kn2−ϵ ) time.


So that, there is an O(k × kn2−ϵ )-time algorithm that solves OVP problem of (A = A1 ∪ · · · Ak ,
B = B1 ∪ · · · Bk ). In other words for the OVP (TVPP, BHCP, BHFP) problem with sets of size
nk binary vectors has an algorithm in O((kn)2−δ )-time (by selecting δ = ϵ/2 and k < n). This
contradicts SETH. As a result, k parallel OVP (TVPP, BHCP, BHFP) problems require O(kn2−ϵ )
time for any ϵ.

This lemma proves that the direct sum of computational complexity for OVP (TVPP, BHCP,
BHFP) problems is valid.

21
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

Appendix G. Discussion of Additive Approximation


This part depends on for a given C value, and the selected δ and ∆ values, the difference ∆ − δ is
positive. So that, this allows us to select an additive error µ.
The following theorem shows that the we cannot reach better elementwise additive approximation
than e−2d log(n+2) = (n + 2)−2d for the ℓ2 distance self-attention. So this element-wise additive
approximation error order is O(n−d ).
2
Theorem 5. Assume SETH. For any i, j ∈ [n], let Sij = f (Qi , Qj ) = e−C∥Qi −Qj ∥2 , where C
is a parameter and for any matrix M ∈ Rn×n . Let h(M ) = M and Y = h(S). V ∈ Rn×dv be
self-attention. Then for any ϵ > 0, computing a matrix Ŷ ∈ Rn×dv that satisfies the following
condition requires Ω(n2−ϵ ) time when dq = ω(log n): |Ŷij − Yij | ≤ µ for all i ∈ [n] and j ∈ [dv ],
where 0 ≤ µ ≤ e−2d log(n+2) = (n + 2)−2d (additive approximation)
Proof Consider two sets A = {a1 , . . . , an } and B = {b1 , . . . , bn } given in BHCP, where ai , bi ∈
{0, 1}d are binary vectors for all i ∈ [n]. The problem is to decide if there exists at least a pair a ∈ A
and b ∈ B that satisfies ∥a − b∥22 < t for a given t ∈ [n].
Suppose that the matrices Q and V are constructed according to BHCP vector gadgets described
in the section 4.2. With this we have Y = h(S) · V ∈ Rn×1 . Since Y is a vector, we abuse the
notation again and define Yi as Pthe ith element 2of Y . Now consider the first n elements of Y . Since
Y = SV , we have that Yi = nj=1 e−C∥ai −bj ∥2 for any i, j ∈ [n].
Case 1. There are no pairs a ∈ A and b ∈ B that satisfies ∥a − b∥22 < t, that is for all i, j ∈ [n],
2
we have ∥a − b∥22 ≥ t, and e−C∥ai −bj ∥2 ≤ e−Ct . Then for all l ∈ [n], Yl ≤ ne−Ct . It is given that
|Ŷl − Yl ≤ µ, so Ŷl ≤ Yl + µ ≤ ne−Ct + µ ≤ ne−Ct + e−2d log(n+2) =: δ.
Case 2. There is a pair a ∈ A and b ∈ B with ∥a − b∥22 < t, that is for some i, j ∈ [n], we have
2
∥a − b∥22 ≤ t − 1, and e−C∥ai −bj ∥2 ≥ e−C(t−1) . Thus for some l ∈ [n], Yl ≥ ne−C(t−1) . It is given
that |Ŷl − Yl ≤ µ , so Ŷl ≥ Yl − µ ≤ ne−Ct − µ ≥ ne−Ct − e−2d log(n+2) =: ∆.
In order to distinguish between two cases, it is sufficient to have ∆ > δ. This holds with
C = 2 log(n + 2).
Thus, if there is an algorithm for computing self-attention up to an element-wise additive error
µ ≤ e−2d log(n+2) that runs in O(n2−ϵ ) time, this entire process takes at most O(n + nd + n2−ϵ ) =
O(n2−ϵ ) as long d = O(n1−ϵ ), and this algorithm decides if there exists at least a pair of vectors
a ∈ A and b ∈ B with ∥a − b∥22 < t in O(n2−ϵ ) time, which contradicts the hardness result of BHCP
(Lemma 3). This completes the proof.

The following theorem shows that the we cannot reach better elementwise additive approximation
2 2
than e−3d log(n)−3d = n−3d · e−3d for the softmax dot-product self-attention. This element-wise
additive approximation error order is O(n−d ).
T
Theorem 6. Assume SETH. For any i, j ∈ [n], let Sij = f (Qi , Qj ) = eQi Qj , and for any matrix
M
M ∈ Rn×n , let h(M ) ∈ Rn×n be the matrix where for all i, j ∈ [n], {h(M )}ij = Pn ijMik . Let
k1

Y = h(S) · V ∈ Rn×dv be self-attention. Then provided dq = ω(log n), for any ϵ > 0, computing a
matrix Ŷ = Rn×dv that satisfies the following condition requires Ω(n2−ϵ ) time:
2
|Ŷij − Yij | ≤ µ for all i ∈ [n] and j ∈ [dv ], where 0 ≤ µ ≤ e−3d log(n)−3d (additive
approximation)

22
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION

Proof The proof is technically similar. Suppose that the matrices Q and V are constructed according
to TVPP vector gadgets described in Section 4.2. With this we have Y = h(S) · V ∈ Rn×1 . Consider
the first n elements of Y . Since h(·) act as the row-wise softmax function, we have

2n n T
X X eCai bj
Yi = Sij =
CaT CaT
Pn Pn
i ak + i bk
j=n+1 j=1 k=1 e k=1 e
CaT
Pn
i bj
j=1 e
= Pn
CaT
i ak +
n CaT
i bk
P
k=1 e k=1 e

Case 1. There are no pairs a ∈ A and b ∈ B that satisfies aT b ≥ t, that is for all i, j ∈ [n],
T
we have aT b ≤ t − 1, and nj=1 eCai bj ≤ neC(t−1) . For a function x+y
x
P
, the maximum value is
neC(t−1) eC(t−1)
achieved at maximum x and minimum y values. Thus, for all l ∈ [n], Yl ≤ neC(t−1) +n
= eC(t−1) +1
.
C(t−1) C(t−1) 2
e
It is given that |Ŷl − Yl | ≤ µ , so Ŷl ≤ Yl + µ ≤ eC(t−1) e
+ µ ≤ eC(t−1) +e −3d log(n)−3d =: δ.
+1 +1
Case 2. There is a pair a ∈ A and b ∈ B with aT b ≥ t, that is for some i, j ∈ [n], we have
T
aT b ≥ t. Then the row sum corresponding to that i, j pair is nj=1 eCai bj ≥ eCt + (n − 1)e0 =
P
x
eCt + n − 1. For a function x+y , the maximum value is achieved at minimum x and maximum y
eCt +(n−1)
values. Thus, for some l ∈ [n], we have Yl ≥ eCt +(n−1)+ned
. It is given that |Ŷl − Yl | ≤ µ , so
eCt +(n−1) Ct
e +(n−1) 2
Ŷl ≥ Yl − µ ≥ eCt +(n−1)+ned
−µ≥ eCt +(n−1)+ned
− e−3d log(n)−3d =: ∆.

Appendix H. Discussion of Remark in Section 4.4


In our results, we show the quadratic hardness of multiplicative error approximations self-attention
matrix elements for dot-product softmax self-attention mechanism. One assumption we make on the
approximation factor µ is that µ < 1. Consider the value of the parameter C = log( 2(1+µ) 1−µ n) + d
in the proof of theorem 5. Notice that when µ = 1, C is not defined. In fact, having µ = 1 implies
that the |Ŷl − Yl | = |Yl | for all l ∈ [n], therefore one can approximate every entry of Ŷ by 0 in O(n)
time while satisfying this condition. However, as mentioned in the remark, one can set µ close to 1
by setting it as 1 − n1x for a constant x, while maintaining the condition on C in the same order.

23

You might also like