Duman Keles23a
Duman Keles23a
Duman Keles23a
Abstract
Transformer architectures have led to remarkable progress in many state-of-art applications. However,
despite their successes, modern transformers rely on the self-attention mechanism, whose time- and
space-complexity is quadratic in the length of the input. Several approaches have been proposed
to speed up self-attention mechanisms to achieve sub-quadratic running time; however, the large
majority of these works are not accompanied by rigorous error guarantees. In this work, we establish
lower bounds on the computational complexity of self-attention in a number of scenarios. We prove
that the time complexity of self-attention is necessarily quadratic in the input length, unless the
Strong Exponential Time Hypothesis (SETH) is false. This argument holds even if the attention
computation is performed only approximately, and for a variety of attention mechanisms. As a
complement to our lower bounds, we show that it is indeed possible to approximate dot-product
self-attention using finite Taylor series in linear-time, at the cost of having an exponential dependence
on the polynomial order.
1. Introduction
Motivation. Building upon early successes in natural language processing (Vaswani et al., 2017;
Kenton and Toutanova, 2019), transformer models now form the core of virtually every state-of-
the-art approach in numerous applications: computer vision and image understanding (Dosovitskiy
et al., 2021; Khan et al., 2021), proteomics (Jumper et al., 2021), code synthesis (Chen et al., 2021a),
and vision-language models (Radford et al., 2021; Alayrac et al., 2022). At the heart of transformer
architectures is the self-attention mechanism, which can be viewed as a trainable “layer” that takes in
as input a set of tokens (vectors), computes pairwise dot-products between (projected forms of) the
tokens, performs a softmax operation to obtain non-negative weights, and produces output tokens
using these weights.
Unfortunately, by virtue of its very definition, the standard form of self-attention requires pairwise
token operations, and therefore incurs quadratic running time in terms of the number of input tokens.
This poses serious computational challenges not just in terms of the training cost of such models,
but even just for inference (forward passes) through transformer models. The community has long
acknowledged this fact, and numerous approximate forms of self-attention that break this quadratic
bottleneck have been proposed. Some approaches advocate reducing complexity through windowing,
striding, or some other sparsification of the attention scores (Kitaev et al., 2020; Zaheer et al., 2020).
Others leverage ideas from hashing (Kitaev et al., 2020; Choromanski et al., 2021). Yet others
propose kernelizing the attention operation (Katharopoulos et al., 2020; Lu et al., 2021; Chen et al.,
2021b).
Empirically, all these methods certainly seem to lead to reduced running times: in some cases,
they reduce the costs from quadratic to linear. However, all of these methods incur some form of
error in the computation (compared to vanilla attention). These errors may have the undesirable (but
benign) effect of drop in accuracy, or may have more dramatic effects if fed with an adversarial input.
In any case, it would be useful to clearly establish rigorous guarantees on time/accuracy tradeoffs for
the above methods, but these are rare. The primary theoretical question that we ask is as follows:
Our contributions. In this paper, we pursue a complementary path from most previously published
work in this area. Somewhat surprisingly, we are able to establish (conditional) quadratic lower
bounds on the running time of self-attention in a large variety of settings. This quadratic barrier holds
even if we relax the self-attention operation and allow for additive, or multiplicative, errors in the
computation. We also prove quadratic — specifically, rectangular — lower bounds even if we allow
for windowing or striding. Finally, this holds even if we use kernelization (with radial basis function
kernels), which to our current knowledge, achieves the Pareto time/accuracy frontier in terms of
empirical performance among all fast algorithms for self-attention computation (Lu et al., 2021). In
Table 1, we summarize hardness results where checkmarks indicate the proven complexities in this
paper for different types of self-attention and calculation types.
Our results demonstrate that there may be a fundamental “no free lunch” phenomenon sitting
here: it seems unlikely that we can get (provably) sub-quadratic algorithms for self-attention that are
also (provably) near-accurate for all inputs.
Finally, while our primary contributions in this paper are mainly from the perspective of lower
bounds, we also provide some upper bounds. Specifically, we show that a finite Taylor series
approximation of the softmax function lends itself to an approximate form of self-attention that can
be computed in linear time. However, a caveat of this result is that the running time now scales
exponentially in the order of the Taylor polynomial.
Techniques. Our proofs are rather intuitive and are based on careful reductions from the Strong
Exponential Time Hypothesis (SETH). SETH-based lower bounds have attracted recent (but growing)
attention from the complexity theory community, and has been used to prove hardness results for
edit distance (Backurs and Indyk, 2015), Frechet distance (Bringmann, 2014), dynamic program-
ming (Bringmann and Künnemann, 2015), among many others (Bringmann, 2021). For machine
learning problems, such lower bounds are less common; still, quadratic-time barriers based on SETH
have been proved for kernel PCA and backpropagation through dense networks (Backurs et al., 2017),
as well as nearest neighbors (Rubinstein, 2018). Our results can be viewed as an addition to this
body of work.
A direct reduction from SETH is cumbersome. So instead, mirroring (Backurs and Indyk, 2015),
we derive reductions from the Orthogonal Vectors Problem (OVP), which is known to require almost-
quadratic time assuming SETH. As intermediate waypoints we visit two adaptations of OVP: the
Thresholded Vectors Product Problem (TVPP), and the Bichromatic Hamming Close Pair (BHCP)
problem. Reductions between these problems require construction of several “vector gadgets", and
form the bulk of the technical difficulty of our proofs. Another subtlety lies in identifying the correct
temperature scaling in the softmax in order to achieve the reductions. See Section 4 for details.
2
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
2. Related work
Attention mechanisms and transformers. Ever since the seminal work of (Vaswani et al., 2017),
transformer architectures (with self-attention layers as their primary building blocks) have become
the cornerstone of state-of-the-art machine learning models. Therefore, a firm theoretical under-
standing about the statistical and computational tradeoffs involved in transformer-based models is of
considerable interest. Transformers have already been shown to exhibit the universal approximation
property (Yun et al., 2019), but lose expressivity unless the self-attention mechanism is accompa-
nied with skip connections (Dong et al., 2021). Self-attention also exhibits undesirable Lipschitz
continuity properties, but this can be fixed by pursuing kernel-like alternatives (Kim et al., 2021).
Speeding up self-attention. Our focus in this paper is on the running time of self-attention
computation. It has been well-established that the (standard) definition of self-attention takes in as
input a length-n sequence of tokens of size d, and requires O(dn2 ) time to compute the output. The
quadratic dependence on n poses a challenge for very long input sequences, both from the training and
testing perspectives. Therefore, several sub-quadratic methods for evaluating attention layers have
been proposed. Approaches (such as the Reformer (Kitaev et al., 2020), Big Bird (Zaheer et al., 2020),
Linformer (Wang et al., 2020), Longformer (Beltagy et al., 2020), or routing transformers (Roy et al.,
2021)) use some combination of hashing, sparsification, or low-rank approximation to speed up the
computation of the attention scores. Other approaches involve replacing the softmax-based attention
with kernel approximations; cf. the work of (Katharopoulos et al., 2020), or the Nyströmformer
(Xiong et al., 2021). More recent works such as the Performer (Choromanski et al., 2021), Slim
(Likhosherstov et al., 2021), or RFA (Peng et al., 2021) approximate the attention computation using
random projections. Methods such as SOFT (Lu et al., 2021) or the Skyformer (Chen et al., 2021b)
propose to replace softmax operations with Gaussian kernels that can then be quickly evaluated.
Despite the large number (and diversity) of interesting algorithmic ideas involved in the above
efforts, the vast majority of these works only focus on improvement in running time; but few (if any)
theoretically characterize the error incurred by their proposed methods. Can there ever be a method
that is both fast (i.e., provably with sub-quadratic running time) as well as near-accurate (i.e., with
provably small additive or multiplicative error)? Our results show that this is unlikely to be the case.
Fine-grained complexity. Classical complexity theory has primarily focused on distinguishing
between problems with efficient (polynomial-time) solutions versus those who don’t. However,
a different (and finer) picture has begun to emerge over the last decade. In particular, the focus
has shifted towards precisely pinning down the exponent, c, of a problem that can be solved in
3
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
polynomial time Õ(nc ). Many of these results are conditional, and rely on reductions from popular
(but plausible) conjectures such as the Strong Exponential Time Hypothesis (SETH) (Impagliazzo
and Paturi, 2001), (Impagliazzo et al., 2001). See the relevant surveys (Indyk, 2017), (Rubinstein
and Williams, 2019), and (Bringmann, 2021) for a more concrete overview of the field. In particular,
this approach has been shown to provide conditional lower bounds on well-known problems such as
edit distance (Backurs and Indyk, 2015), Frechet distance (Bringmann, 2014), dynamic time warping
(Bringmann and Künnemann, 2015), longest common subsequence (LCS) (Abboud et al., 2015),
Hausdorff distance (Bringmann and Nusser, 2021), and string matching (Abboud et al., 2018). In the
context of machine learning and massive data analysis, reductions from SETH have been fruitfully
applied to problems such as clustering (Abboud et al., 2019), kernel PCA (Backurs et al., 2017), and
approximate nearest neighbors (Rubinstein, 2018).
QK T
Attention(Q, K, V ) = softmax √ V. (1)
dk
Let us consider the case where n is very large compared to dk , dv . By very virtue of its definition,
2 ) time to compute this self-attention operation: 1) calculation of
√ to incur O(n
we might expect
S = QK / dk takes O(n dk ), 2) exponentiation and calculation of row sum of S takes O(n2 ) time,
T 2
3) division of each element of S with the corresponding row sum takes O(n2 ), and 4) multiplication
of softmax(QK T ) and V takes O(n2 dv ) time. Therefore, the computational complexity of this naive
approach to compute self-attention scales quadratically in n.
A generalized form of self-attention. While Eq. 1 is the typical way to define self-attention, we
also consider a more general form. Let f : Rd × Rd → R be a function that takes two vectors in
Rd as input. Then, the self-attention score matrix S is defined as Sij = f (Qi , Kj ) for all i, j ∈ [n].
Also, let h : Rn×n → Rn×n be some kind of normalization function. Then a more abstract definition
of self-attention can be expressed as:
4
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
Definition 1 (Orthogonal Vectors Problem (OVP)) Two sets with cardinality A = {a1 , . . . , an }
and B = {b1 , . . . , bn } are given, where ai , bi ∈ {0, 1}d are binary vectors for all i ∈ [n]. The
problem decides if there exists at least one pair of vectors a ∈ A and b ∈ B such that aT b = 0.
Previous work has that for all ϵ > 0, there is no O(n2−ϵ ) algorithm solves OVP for d = ω(log n)
unless SETH is false (Williams, 2005). In other words, for any ϵ > 0, problem needs at least O(n2−ϵ )
time for d = ω(log n). In this work, we will primarily give lower bounds to the computational
complexity of self-attention mechanism by showing reductions from OVP.
Definition 2 (Threshold Vectors Product Problem (TVPP)) Two sets with equal cardinality A =
{a1 , . . . , an } and B = {b1 , . . . , bn } are given, where ai , bi ∈ {0, 1}d are binary vectors for all
i ∈ [n]. The problem decides if there exists at least one pair a ∈ A and b ∈ B such that aT b ≥ t for
a given t ∈ [n].
Definition 3 (Bichromatic Hamming Close Pair Problem (BHCP)) Two sets with equal cardi-
nality A = {a1 , . . . , an } and B = {b1 , . . . , bn } are given, where ai , bi ∈ {0, 1}d are binary vectors
for all i ∈ [n]. The problem decides if there exists at least one pair of vectors a ∈ A and b ∈ B such
that ∥a − b∥2 < t.
5
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
Both TVPP and BHCP can be shown to be SETH-hard by showing reductions from OVP
(Definition 1) as shown in Lemmas 1, 2, and 3. The hardness of BHCP is an established result
(Backurs et al., 2017), but we provide an improved reduction from OVP to BHCP via a new problem:
Bichromatic Hamming Far Pair problem (BHFP). In contrast to established reductions from OVP, we
get rid of additional factors of d (the dimension of binary vectors) that incur during the process.
Definition 4 (Bichromatic Hamming Far Pair Problem (BHFP)) Two sets with equivalent car-
dinality A = {a1 , . . . , an } and B = {b1 , . . . , bn } are given, where ai , bi ∈ {0, 1}d are binary
vectors for all i ∈ [n]. The problem decides if there exists at least a pair of vectors a ∈ A and b ∈ B
such that ∥a − b∥2 ≥ t.
Before discussing the hardness of computing self-attention, we state the hardness guarantees of
TVPP, BHFP, and BHCP formally, with proofs deferred to Appendix A.
Lemma 1 Assume SETH. Then for any ϵ > 0, the computational complexity of TVPP is Ω(n2−ϵ )
for d = ω(log n).
Lemma 2 Assume SETH. Then for any ϵ > 0, the computational complexity of BHFP is Ω(n2−ϵ )
for d = ω(log n).
Lemma 3 Assume SETH. Then for any ϵ > 0, the computational complexity of BHCP is Ω(n2−ϵ )
for d = ω(log n).
Below, we show a series of reductions from TVPP and BHCP to several well-studied self-
attention mechanisms. We mainly modify the functions f (.) and h(.) (see√ Eq. 2) to reflect each
attention mechanism in our arguments. We ignore the scaling factor 1/ dk when computing the
function f (.) in dot-product
√ self-attention for easier exposition as we only require to scale every
element of Q or K by 1/ dk which takes O(ndk ) time. Also, we note that several approaches for
efficient transformers were developed with the particular case of Q = K (Kitaev et al., 2020), and
our hardness results are valid for this specific instance where Q = K and by direct reduction, and
hardness guarantees hold for any universal (Q, K) as well. Also, all the process is still valid for
multi-head self attention which is proved in Appendix F.
6
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
BHCP vector gadgets. Let A = {a1 , . . . , an } and B = {b1 , . . . , bn } given in BHCP, where
ai , bi ∈ {0, 1}d are binary vectors for all i ∈ [n]. We construct our vector gadgets in the following
way. First, create the matrix Q ∈ R2n×d with its rows as Qi = ai for all i ∈ [n], and Qn+j = bj for
all j ∈ [n]. Then we create the matrix V ∈ R2n×1 by setting first n elements to 0 and the second n
elements to 1.
The entire construction process (including multiplying each ai , bi by C) of vector gadgets takes
O(nd) time. We use the above gadgets in the following reductions.
Proof Suppose that the matrices Q and V are constructed as TVPP vector gadgets described above
in Section 4.2. With this, we have Y = h(S) · V ∈ Rn×1 . Since Y is a vector, we slightly abuse
notation and define Yi as the ith element of Y . Now consider the first n elements of Y . Since
Y = SV , we have that
n
T
X
Yi = eCai bj for any i ∈ [n].
j=1
Now we check the magnitude of Yi , i ∈ [n] in order to distinguish between true and false cases
in TVPP. It takes O(n) time to check each Yi , i ∈ [n]. First we focus on the exact computation.
Consider the following two cases.
Case 1. There are no pairs a ∈ A and b ∈ B with aT b ≥ t, that is for all i, j ∈ [n], we have
T
aTi bj ≤ t − 1, and eCai bj ≤ eC(t−1) . Then for all l ∈ [n], Yl ≤ neC(t−1) := δ.
Case 2. There is a pair a ∈ A and b ∈ B with aT b ≥ t, that is for some i, j ∈ [n], we have
T
aTi bj ≥ t, and eCai bj ≥ eCt . Thus for some l ∈ [n], we have Yl ≥ eCt := ∆
To distinguish between the two cases, it is sufficient to have ∆ > δ. But this holds when
C = 2 log n.
Now let us consider multiplicative approximation error. With a µ-multiplicative factor, if there
are no pairs a ∈ A and b ∈ B with aT b ≥ t, then we have for all l ∈ [n], Ŷl ≤ (1 + µ)Yl ≤
(1 + µ)neC(t−1) := δ̂. On the other hand, if there is a pair a ∈ A and b ∈ B with aT b ≥ t, then for
some l ∈ [n], we have Ŷl ≥ (1 − µ)Yl ≥ (1 − µ)eCt := ∆. ˆ In order to distinguish between two
cases, it is sufficient to have ∆ > δ̂ and this inequality holds with C = 2 log( 1+µ
ˆ
1−µ n).
Finally we look at additive approximation error. With a µ-additive factor, if there are no pairs
a ∈ A and b ∈ B with aT b ≥ t, then we have for all l ∈ [n], Ŷl ≤ Yl + µ ≤ neC(t−1) + µ := δ̂. On
the other hand, if there is a pair a ∈ A and b ∈ B with aT b ≥ t, then for some l ∈ [n], we have
7
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
The above proof is for computing the output of self-attention. As an easy consequence, we
can also show that computing the self-attention score matrix, S, with either exact or element-wise
multiplicative/additive error requires quadratic time, conditioned on SETH. The argument follows a
similar proof as Theorem 4, and is given in Appendix C in detail.
Proof The proof is technically similar to the one above. Suppose that the matrices Q and V
are constructed according to TVPP vector gadgets described in Section 4.2. With this we have
Y = h(S) · V ∈ Rn×1 . Consider the first n elements of Y . Since h(.) act as the row-wise softmax
function, we have
CaT
2n n Pn
i bj
T
X X eCai bj j=1 e
Yi = Sij = = .
aT CaT aT CaT
Pn n n n
i ak + i bk i aj + i bj
P P P
j=n+1 j=1 k=1 e k=1 e j=1 e j=1 e
Case 2. There is a pair a ∈ A and b ∈ B with aT b ≥ t, that is for some i, j ∈ [n], we have
T
aTi bj ≥ t. Then the row sum corresponding to that i, j pair is nj=1 eCai bj ≥ eCt + (n − 1)e0 =
P
x
eCt + (n − 1). For a function x+y , the minimum value is achieved at minimum x and maximum y
Ct T
values. Thus, for some l ∈ [n], we have Yl ≥ eCte+(n−1)+ne
+(n−1)
d := ∆ and e
Cai bj ≥ eCt .
In order to distinguish between two cases, it is sufficient to have ∆ > δ which means we require
[eCt + (n − 1)] > ned [eC(t−1)] . This holds with C = log n + d.
8
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
Next, consider multiplicative error approximation. Select same TVPP vector gadgets except
this time the matrix V ∈ R2n×1 is set first n elements to 1 and the second n elements to 0. Since
|Ŷl − Yl | ≤ µ|Yl | for all i ∈ [2n], we have (1 − µ)Yi ≤ Ŷi ≤ (1 + µ)Yi . Now consider the values of
Ŷi , i ∈ [n] in the following two cases.
Case 1. There are no pairs a ∈ A and b ∈ B with aT b ≥ t, that is for all i, j ∈ [n], we have
T
ai bj ≤ t − 1. This means that nj=1 eCai bj ≤ neC(t−1) . For a function x+y
T x
P
, the minimum value
is achieved at the minimum x and maximum y values. Thus, for all l ∈ [n], Yl ≥ n+nenC(t−1) =
1 1
eC(t−1) +1
which means that for all l ∈ [n], we have Ŷl ≥ (1 − µ)Yl ≥ (1 − µ) eC(t−1) +1
:= ∆.
T
Case 2. There is a pair a ∈ A and b ∈ B with a b ≥ t, that is for some i, j ∈ [n], we have
T
aTi bj ≥ t. Then the row sum corresponding to that i, j pair is nj=1 eCai bj ≥ eCt + (n − 1)e0 =
P
x
eCt + (n − 1). For a function x+y , the maximum is achieved at the maximum x and minimum y
ned
values. Thus, for some l ∈ [n], we have Yl ≤ eCt +(n−1)+ned
which means that for some l ∈ [n], we
ned
have Ŷl ≤ (1 + µ)Yl ≤ (1 + µ) eCt +(n−1)+ned := δ.
In order to distinguish between the two cases, it is sufficient to have ∆ > δ. This holds
with C = log( 2(1+µ)
1−µ n) + d. Thus, if there is an algorithm for computing self-attention up to an
element-wise multiplicative error µ that runs in O(n2−ϵ ) time, this entire process takes at most
O(n + nd + n2−ϵ ) = O(n2−ϵ ) as long as d = o(n1−ϵ ), and this algorithm decides if there exists a
pair of vectors a ∈ A, b ∈ B such that aT b ≥ t in O(n2−ϵ ) time, contradicting TVPP hardness.
Remark. For the multiplicative error approximations, we can set µ to a value arbitrarily close
to 1 with a dependence on n, i.e. µ = 1 − Θ(1/nx ) for a sufficiently large(constant) order x. For
2)
example, with x = 2, we require C > 2 log( 2−Θ(1/n
Θ(1/n2 )
n) ≈ 2[log 2 + 3 log(Θ(n))]. Therefore, to
distinguish between two cases with a µ multiplicative error approximation with µ = 1 − Θ(1/n2 ),
we only require C = O(log(n)). We discuss this matter in more detail in the Appendix H.
Also for any matrix M ∈ Rn×n , let h(M ) = M . Let Y = h(S) · V ∈ Rn×dv be a self-attention.
Then for any ϵ > 0, computing a matrix Ŷ ∈ Rn×dv that satisfies any of the following conditions
requires Ω(nw1−ϵ ) time when dq = ω(log w) and w = ω(dq ).
1. Ŷ = Y (exact).
9
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
2. |Ŷij − Yij | ≤ µ|Yij | for all i ∈ [n] and j ∈ [dv ] where 0 ≤ µ < 1 (multiplicative approximation).
3. |Ŷij − Yij | ≤ µ for all i ∈ [n] and j ∈ [dv ] where 0 ≤ µ (additive approximation).
√
Proof See Appendix B for a full proof. First, a TVPP problem with sets size k = nw are
constructed. Then sliding-window self attention is computed for all the points in order, so that for
each TVPP problem, all the pairs are in the window. An appropriate value matrix V is constructed
for selection of this pairs. If the overall self attention can be calculated in O((nw)1−ϵ ) time, then
TVPP problem can be solved in O(k 2−ϵ ) which contradicts Lemma 1.
Proof (Sketch). The Q and V matrices and constructed according to BHCP vector gadgets described
in Section 4.2. With these, we define Y = h(S) · V ∈ Rn×1 . Considering the first n elements of
Y and a suitable selection of C, our goal is to make the two cases (whether there is a solution for
BHCP or not) distinguishable. Thus, if there is an algorithm for computing self-attention up to an
element-wise multiplicative error µ that runs in O(n2−ϵ ) time, this algorithm decides if there exists a
solution or not for BHCP in O(n2−ϵ ) time, which contradicts BHCP hardness (Lemma 3).
10
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
linear time in n. Once we have the denominator for each row, h(S) · V can be computed by dividing
each row of SV by the corresponding denominator.
Lemma 8 Let p be an integer ≥ 0 and let Sij be C · (QTi Kj )p where i, j ∈ [n] and C is a constant.
Then SV can be computed in O(ndpq dv ) time.
Pn
Lemma 9 Let p be an integer ≥ 0. Then for all i ∈ [n], ĵ=1
C · (QTi Kĵ )p , ĵ ∈ [n] where C is a
constant can be computed in O(ndpq ) time.
Proof sketch of Lemma 8 and 9. When f (Qi , Vj ) = QTi Vj , we directly multiply matrices K T and
V to obtain K T V in O(dq ndv ) time, then Q and K T V in O(ndq dv ) gives the desired matrix SV in
O(n) time. The sum of a row of S can be simply computed by storing P the sum of vectors KPj , j ∈ [n]
n T T n
in memory in O(ndq ) time and reusing this for each row i to compute j=1 Qi Kj = Qi j=1 Kj
T p
in O(ndq ) time. This idea can be extended to (Qi Kj ) .
Proof The result is implied by Lemma P8p and Lemma 9. Let xij = QTi Kj ; i, j ∈ [n]. Define
z
the polynomial function of order p as z=0 cz xij where cz , z ∈ {0, . . . , p} are constants. Let
(z) (z)
S (1) , . . . , S (p) ∈ Rn×n where Sij = cz xzij . Now we can write Sij = pz=0 Sij , thus SV =
P
Pp (z) z
z=0 Sij V . By Lemma 8, each term of this summation can be computed in O(ndq dv ) time.
p
Therefore the overall time complexity of computing SV is O(ndq dv ) (considering the largest
exponent of dq is when z = p).
Let si be the sum of the elements of ith row of S. What remains is computing h(S).V . This
can be computed indirectly by first computing SV and the dividing each row i of SV by si . If we
have precomputed each si , i ∈ [n], then this process takes O(ndv ) time, since we are dividing each
element of SV (∈ Rn×dv ) by a scaler. What remains to show is that computing si , ∀i ∈ [n] takes
O(npdpq ) time. Observe that
n p
n X p X
n p X
n
(z) (z)
X X X X
si = Sij = Siĵ = Siĵ = cz xziĵ
ĵ=1 ĵ=1 z=0 z=0 ĵ=1 z=0 ĵ=1
From Lemma 9, each term of the outer summation over z can be computed O(ndzq ) time for all
i ∈ [n] and overall time complexity is O(ndpq )(taking the largest exponent of dq similarly).
Corollary 11 Let p be an non-negative integer. The pth order polynomial approximation of dot-
product softmax self-attention using matrices Q, K, V with finite Taylor series can be computed in
O(ndpq dv ) time.
1
Proof The result follows from replacing constants cz = z! for z = 0, . . . , p in Theorem 10.
11
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
6. Conclusions
In this paper we investigate fundamental bounds on the computational complexity of self-attention.
We examine numerous state-of-the-art self-attention mechanisms, and prove quadratic (or rectangular)
lower bounds assuming the Strong Exponential Time Hypothesis (SETH). Even though a large
number of recent works have proposed fast approximations to self-attention, our results imply that it
may be difficult to both overcome the quadratic runtime barrier while still retaining high accuracy. On
the positive side, we show that linear-time computation is possible if we choose the score computation
function in the form of a polynomial.
Our work leaves open several directions. At a high level our theorems establish a result between
‘exponential’ and ‘polynomial’ forms of self-attention, but having a clearer picture of the landscape
may be helpful. Moreover, our results are for worst-case inputs; similar hardness results on average-
case inputs is an interesting direction. Finally, we leave the door open for the possibility of randomized
algorithms that achieve sub-quadratic complexity and are correct with high probability.
Acknowledgments
This work was supported in part by the National Scienc Foundation (under grants CCF-2005804 and
CCF-1801495) and USDA/NIFA (under grant 2021-67021-35329).
References
Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for lcs and
other sequence similarity measures. In 2015 IEEE 56th Annual Symposium on Foundations of
Computer Science, pages 59–78. IEEE, 2015.
Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. If the current clique algorithms
are optimal, so is valiant’s parser. SIAM Journal on Computing, 47(6):2527–2555, 2018.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel
Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language
model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time
(unless seth is false). In Proceedings of the forty-seventh annual ACM symposium on Theory of
computing, pages 51–58, 2015.
Arturs Backurs, Piotr Indyk, and Ludwig Schmidt. On the fine-grained complexity of empirical risk
minimization: Kernel methods and neural networks. Advances in Neural Information Processing
Systems, 30, 2017.
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer.
CoRR, abs/2004.05150, 2020.
12
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
Karl Bringmann. Why walking the dog takes time: Frechet distance has no strongly subquadratic
algorithms unless seth fails. In 2014 IEEE 55th Annual Symposium on Foundations of Computer
Science, pages 661–670. IEEE, 2014.
Karl Bringmann. Fine-grained complexity theory: Conditional lower bounds for computational
geometry. In Conference on Computability in Europe, pages 60–70. Springer, 2021.
Karl Bringmann and Marvin Künnemann. Quadratic conditional lower bounds for string problems
and dynamic time warping. In 2015 IEEE 56th Annual Symposium on Foundations of Computer
Science, pages 79–97. IEEE, 2015.
Karl Bringmann and André Nusser. Translating hausdorff is hard: Fine-grained lower bounds
for hausdorff distance under translation. In 37th International Symposium on Computational
Geometry, 2021.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared
Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large
language models trained on code. CoRR, 2021a.
Yifan Chen, Qi Zeng, Heng Ji, and Yun Yang. Skyformer: Remodel self-attention with gaussian
kernel and nystr\" om method. Advances in Neural Information Processing Systems, 34, 2021b.
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea
Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al.
Rethinking attention with performers. In International Conference on Learning Representations,
2021.
Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: Pure
attention loses rank doubly exponentially with depth. In International Conference on Machine
Learning, pages 2793–2803. PMLR, 2021.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image
is worth 16x16 words: Transformers for image recognition at scale. In International Conference
on Learning Representations, 2021.
Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. Journal of Computer and
System Sciences, 62(2):367–375, 2001. ISSN 0022-0000.
Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have strongly exponen-
tial complexity? Journal of Computer and System Sciences, 63(4):512–530, 2001.
Piotr Indyk. Beyond p vs. np: quadratic-time hardness for big data problems. In Proceedings of the
29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 1–1, 2017.
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger,
Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate
protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
13
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns:
Fast autoregressive transformers with linear attention. In International Conference on Machine
Learning, pages 5156–5165. PMLR, 2020.
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages
4171–4186, 2019.
Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and
Mubarak Shah. Transformers in vision: A survey. ACM Computing Surveys (CSUR), 2021.
Hyunjik Kim, George Papamakarios, and Andriy Mnih. The lipschitz constant of self-attention. In
International Conference on Machine Learning, pages 5562–5571. PMLR, 2021.
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In
International Conference on Learning Representations, 2020.
Valerii Likhosherstov, Krzysztof M Choromanski, Jared Quincy Davis, Xingyou Song, and Adrian
Weller. Sub-linear memory: How to make performers slim. Advances in Neural Information
Processing Systems, 34, 2021.
Jiachen Lu, Jinghan Yao, Junge Zhang, Xiatian Zhu, Hang Xu, Weiguo Gao, Chunjing XU, Tao
Xiang, and Li Zhang. Soft: Softmax-free transformer with linear complexity. In M. Ranzato,
A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural
Information Processing Systems, volume 34, pages 21297–21309. Curran Associates, Inc., 2021.
Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong.
Random feature attention. In International Conference on Learning Representations, 2021.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. In International Conference on Machine Learning,
pages 8748–8763. PMLR, 2021.
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse
attention with routing transformers. Transactions of the Association for Computational Linguistics,
9:53–68, 2021.
Aviad Rubinstein. Hardness of approximate nearest neighbor search. In Proceedings of the 50th
annual ACM SIGACT symposium on theory of computing, pages 1260–1268, 2018.
Aviad Rubinstein and Virginia Vassilevska Williams. Seth vs approximation. ACM SIGACT News,
50(4):57–76, 2019.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing
systems, 30, 2017.
Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with
linear complexity. arXiv preprint arXiv:2006.04768, 2020.
14
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theoretical
Computer Science, 348(2):357–365, 2005.
Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and
Vikas Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14138–14148,
2021.
Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are
transformers universal approximators of sequence-to-sequence functions? In International
Conference on Learning Representations, 2019.
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago
Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for
longer sequences. Advances in Neural Information Processing Systems, 33:17283–17297, 2020.
15
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
Lemma 2. Assume SETH. Then for any ϵ > 0, the computational complexity of BHFP is Ω(n2−ϵ )
for d = ω(log n).
Proof Consider the two sets A = {a1 , . . . , an } and B = {b1 , . . . , bn } given in OVP, where
ai , bi ∈ {0, 1}d are binary vectors for all i ∈ [n].
For any vector ai ∈ A, let us define āi ∈ R3d as the concatenation of vector ai , vector (1 − ai )
and vector 0 where 1 := (1, 1, . . . , 1) ∈ Rd and 0 := (0, 0, . . . , 0) ∈ Rd . Then we have āi T āi =
ai T ai + (1 − ai )T (1 − ai ) = d. Now define Ā as the set of āi s. For any vector bj ∈ B, let us
define b¯j ∈ R3d as the concatenation of vector bj , vector 0, and vector (1 − bj ). Then we have
T
b¯j b¯j = bj T bj + (1 − bj )T (1 − bj ) = d. Now define B̄ as the set of b¯j . Because the overall
dimensions of A, B and Ā, B̄ are nd and 3nd respectively, this process takes O(nd) time. Now for
any i, j ∈ [n] we have
āi T b¯j = ai T bj + (1 − ai )T 0 + 0T (1 − bj ) = ai T bj .
The squared ℓ2 distance between āi and b¯j for any i, j ∈ [n] is
T
∥āi − b¯j ∥22 = (āi − b¯j )T (āi − b¯j ) = āi T āi + b¯j b¯j − 2āi T b¯j = d + d − 2āi T b¯j = 2d − 2ai T bj .
∥āi − b¯j ∥22√= 2d if and only if ai T bj = 0. Now if we run the algorithm for BHFP with√the
threshold t = 2d on the sets Ā and B̄ to find a pair ā ∈ Ā and b̄ ∈ B̄ that satisfy ∥ā − b̄∥2 ≥ 2d,
then we can conclude that there is a pair a ∈ A and b ∈ B that satisfies aT b = 0.
Thus, if there is an algorithm for BHFP that runs in O(n2−ϵ ) time, this entire process takes at
most O(nd + n2−ϵ ) = O(n2−ϵ ) as long as d = o(n1−ϵ ), and this algorithm decides if there exists at
least a pair of vectors a ∈ A and b ∈ B such that aT b = 0 in O(n2−ϵ ) time, which contradicts the
hardness result of OVP. As a result, for at least one t, BHFP problem cannot be solved in O(n2−ϵ )
time.
Lemma 3. Assume SETH. Then for any ϵ > 0, the computational complexity of BHCP is Ω(n2−ϵ )
for d = ω(log n).
Proof Consider the two sets A = {a1 , . . . , an } and B = {b1 , . . . , bn } given in BHFP, where
ai , bi ∈ {0, 1}d are binary vectors for all i ∈ [n]. Let t, t′ be the given threshold for BHFP and
BHCP respectively.
For any vector bj ∈ B, let us define b¯j ∈ Rd as (1 − bj ), where 1 := (1, 1, . . . , 1) ∈ Rd . Now
define B̄ as the set of b¯j . Because the overall dimension of B and B̄ is nd, this process takes O(nd)
time. The squared ℓ2 distance between ai and b¯j for any i, j ∈ [n] is
16
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
the hardness result of BHFP. As a result, for at least one t′ , BHCP problem cannot be solved in
O(n2−ϵ ) time.
Also for any matrix M ∈ Rn×n , let h(M ) = M . Let Y = h(S) · V ∈ Rn×dv be a self-attention.
Then for any ϵ > 0, computing a matrix Ŷ ∈ Rn×dv that satisfies any of the following conditions
√
requires Ω((nw)1−ϵ ) time when dq = ω(log nw) and w = ω(dq ).
1. Ŷ = Y (exact).
2. |Ŷij − Yij | ≤ µ|Yij | for all i ∈ [n] and j ∈ [dv ] where 0 ≤ µ < 1(multiplicative approxima-
tion).
3. |Ŷij − Yij | ≤ µ for all i ∈ [n] and j ∈ [dv ] where 0 ≤ µ(additive approximation).
Proof Consider the two sets A = {a1 , . . . , ak } and B = {b1 , . . . , bk } given in TVPP, where
√
k = nw and ai , bi ∈ {0, 1}d are binary vectors for all i ∈ [k]. The problem is to decide if there
exists at least a pair a ∈ A and b ∈ B that satisfies aT b ≥ t for a given t ∈ [d].
We construct our matrices Q and V in the following way. Firstly, define the function (α
(mod k)) := m if α ≡ m (mod k) and m ∈ [k]. Create the matrix Q ∈ R3n×d with its
rows with even indices as Q2α = Cb(α (mod k)) , and its rows with odd indices as Q2α−1 =
a(α (mod k))+w⌊ 2α−1 ⌋ . Thus, this process takes O(nd) time. By this construction, each (ai , bj )
2k
pair is found at a distance w.
Also, select V as concatenation of vector (1, 0, . . . , 1, 0) ∈ R3n .
With this we have Y = h(S) · V ∈ R3n×1 . Since Y is a vector, we abuse the notation again and
define Yi as the ith element of Y .
Now consider the first n even rows of Y . Because odd rows of matrix Q is from set A, and even
rows of matrix Q is from set B, the even rows of Y becomes the summation of the exponential of
the C times of the dot products of w different (ai , bj ) pairs by the definition of vector V .
Also, each (ai , bj ) pair is found at a distance w, so that the exponential of the C times dot
products of all pairs appear in the sliding window attention score matrix and contributes Y2l for at
least an l ∈ [n] value.
Now we check the magnitudes of Y2l for l ∈ [n] in order to distinguish between true and false
cases in TVPP. It takes O(n) time to check these n values. First, we focus on the exact computation.
Consider the following two cases.
Case 1. There are no pairs a ∈ A and b ∈ B with (a)T b ≥ t, that is for all i, j ∈ [k], we have
T
(ai )T bj ≤ t − 1, and eC(ai ) bj ≤ eC(t−1) . Then for all l ∈ [n], we have Y2l ≤ weC(t−1) := δ.
Case 2. There is a pair a ∈ A and b ∈ B with aT b ≥ t, that is for some i, j ∈ [k], we
T
have (ai )T bj ≥ t, and eC(ai ) bj ≥ eCt . Because any pair appears in some element of Y , we have
Y2l ≥ eCt := ∆ for some odd l ∈ [n].
17
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
In order to distinguish between two cases, it is sufficient to have ∆ > δ. This holds with
C = 2 log w.
Now let us consider multiplicative approximation error. With a µ-multiplicative factor, if there
are no pairs a ∈ A and b ∈ B with aT b ≥ t, then we have for all l ∈ [n], Ŷ2l ≤ (1 + µ)Y2l ≤
(1 + µ)weC(t−1) := δ̂. On the other hand, if there is a pair a ∈ A and b ∈ B with aT b ≥ t, then for
some l ∈ [n], we have Ŷ2l ≥ (1 − µ)Y2l ≥ (1 − µ)eCt := ∆. ˆ In order to distinguish between two
ˆ > δ̂ and this inequality holds with C = 2 log( 1+µ w).
cases, it is sufficient to have ∆ 1−µ
Finally we look at additive approximation error. With a µ-additive factor, if there are no pairs
a ∈ A and b ∈ B with aT b ≥ t, then we have for all l ∈ [n], Ŷ2l ≤ Y2l + µ ≤ weC(t−1) + µ := δ̂.
On the other hand, if there is a pair a ∈ A and b ∈ B with aT b ≥ t, then for some l ∈ [n], we have
Ŷ2l ≥ Y2l − µ ≥ eCt − µ := ∆. ˆ In order to distinguish between two cases, it is sufficient to have
∆ˆ > δ̂ and this inequality holds with C = 2 log(w + 2µ).
Thus, if there is an algorithm for computing self-attention up to an element-wise multiplicative
or additive error µ that runs in O(k 2−ϵ ) time, this entire process takes at most O(nd + k 2−ϵ ) =
√ 2−ϵ
O( nw ) = O((nw)1−ϵ ) as long as d = o(w1−ϵ ). Therefore, this algorithm decides if there
exists at least a pair of vectors a ∈ A and b ∈ B such that aT b ≥ t in O((nw)1−ϵ ) time, which
contradicts the hardness result of TVPP (Lemma 1). This completes the proof.
A similar proof also applies for dilated sliding window (Beltagy et al., 2020), where the self-
attention score is calculated as Theorem 6. Also, when the self-attention score is the softmax dot
product (where softmax is only applied to the window size in each row), one can prove O(nw1−ϵ )
complexity by following the proof of Theorem 5.
1. |Ŝij − Sij | ≤ µ|Sij | for all i ∈ [n] and j ∈ [dv ] where 0 ≤ µ < 1(multiplicative approxima-
tion).
Let Y = h(Ŝ).V ∈ Rn×dv be a self-attention. Then for any ϵ > 0, computing self-attention
Y ∈ Rn×dv requires Ω(n2−ϵ ) time when dq = ω(log n).
Proof Consider two sets A = {a1 , . . . , an } and B = {b1 , . . . , bn } given in TVPP, where ai , bi ∈
{0, 1}d are binary vectors for all i ∈ [n]. The problem is to decide if there exists at least a pair a ∈ A
and b ∈ B that satisfies aT b ≥ t for a given t ∈ [n].
Suppose that the matrices Q and V are constructed according to TVPP vector gadgets described
in the section 4.2. With this we have Y = h(Ŝ).V ∈ Rn×1 . Since Y is a vector, we abuse the
notation again and define Yi as the ith element of Y . Now consider the first n elements of Y . Since
Y = SV , we have that
Xn
Yi = Ŝij for any i ∈ [n].
j=1
18
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
Now we check the magnitude of Yi , i ∈ [n] in order to distinguish between true and false cases
in TVPP. It takes O(n) time to check each Yi , i ∈ [n]. First we focus on the multiplicative error
approximation. Consider the following two cases.
Case 1. There are no pairs a ∈ A and b ∈ B with aT b ≥ t, that is for all i, j ∈ [n], we have
T T
aTi bj ≤ t − 1, and Sij = eCai bj ≤ eC(t−1) , so Ŝij ≤ (1 + µ)Sij = (1 + µ)eCai bj ≤ (1 + µ)eC(t−1) .
Then for all l ∈ [n], Yl ≤ n(1 + µ)eC(t−1) := δ.
Case 2. There is a pair a ∈ A and b ∈ B with aT b ≥ t, that is for some i, j ∈ [n], we have
T T
ai bj ≥ t, and Sij = eCai bj ≥ eCt , so Ŝij ≥ (1 − µ)Sij = (1 − µ)eCai bj ≥ (1 − µ)eCt . Then for
T
Similar proofs also work to show the quadratic complexity of multiplicative approximation to the
S of softmax dot-product self-attention, and ℓ2 -self-attention directly from Theorem 5 and Theorem
7. And also by Theorem 6, one can show the quadratic complexity of additive and multiplicative
approximation to the S of sliding window dot-product self-attention.
where x = Qi , y = Kj and xr , yr denote the rth element of the corresponding vectors. Given a
p
vector v ∈ Rdq , let us define a function α : Rdq → Rdq where elements of α(v) are computed
19
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
Lemma 5. Let p be an integer ≥ 0. Then for all i ∈ [n], nĵ=1 C · (QTi Kĵ )p , ĵ ∈ [n] where C is
P
Proof Consider two sets A = {a1 , . . . , an } and B = {b1 , . . . , bn } given in BHCP, where ai , bi ∈
{0, 1}d are binary vectors for all i ∈ [n]. The problem is to decide if there exists at least a pair a ∈ A
and b ∈ B that satisfies ∥a − b∥22 < t for a given t ∈ [n].
Suppose that the matrices Q and V are constructed according to BHCP vector gadgets described
in the section 4.2. With this we have Y = h(S).V ∈ Rn×1 . Since Y is a vector, we abuse the
notation again and define Yi as the ith element of Y . Now consider the first n elements of Y . Since
Y = SV , we have that
n
X 2
Yi = e−C∥ai −bj ∥2 for any i ∈ [n].
j=1
Now we check the magnitude of Yi , ∈ [n] in order to distinguish between true and false cases
in BHCP. It takes O(n) time to check each Yi , i ∈ [n]. First we focus on the exact computation.
Consider the following two cases.
20
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
Case 1. There are no pairs a ∈ A and b ∈ B with ∥ai − bj ∥22 ≤ t, that is for all i, j ∈ [n], we
2
have ∥ai − bj ∥22 ≥ t, and e−C∥ai −bj ∥2 ≤ e−Ct . Then for all l ∈ [n], Yl ≤ ne−Ct := δ.
Case 2. There is a pair a ∈ A and b ∈ B with ∥a − b∥22 < t, that is for some i, j ∈ [n],
2
we have ∥ai − bj ∥22 ≤ t − 1, and e−C∥ai −bj ∥2 ≥ e−C(t−1) . Thus for some l ∈ [n], we have
Yl ≥ e−C(t−1) := ∆
In order to distinguish between two cases, it is sufficient to have ∆ > δ. This holds with
C = 2 log n.
Now let us look at the multiplicative error approximation. With a µ multiplicative factor, if
there are no pairs a ∈ A and b ∈ B with aT b ≥ t, then we have for all l ∈ [n], Ŷl ≤ (1 + µ)Yl ≤
(1 + µ)ne−Ct := δ̂. On the other hand, if there is a pair a ∈ A and b ∈ B with aT b ≥ t, then for
some l ∈ [n], we have Ŷl ≥ (1 − µ)Yl ≥ (1 − µ)e−C(t−1) := ∆. ˆ In order to distinguish between
ˆ 1+µ
two cases, it is sufficient to have ∆ > δ̂ and this inequality holds with C = 2 log( 1−µ n).
Thus, if there is an algorithm for computing self-attention up to an element-wise multiplicative
error µ that runs in O(n2−ϵ ) time, this entire process takes at most O(n + nd + n2−ϵ ) = O(n2−ϵ )
as long as d = o(n1−ϵ ), and this algorithm decides if there exists at least a pair of vectors a ∈ A and
b ∈ B such that ∥ai − bj ∥22 ≥ t in O(n2−ϵ ) time, which contradicts the hardness result of BHCP
(Lemma 3). This completes the proof.
This lemma proves that the direct sum of computational complexity for OVP (TVPP, BHCP,
BHFP) problems is valid.
21
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
The following theorem shows that the we cannot reach better elementwise additive approximation
2 2
than e−3d log(n)−3d = n−3d · e−3d for the softmax dot-product self-attention. This element-wise
additive approximation error order is O(n−d ).
T
Theorem 6. Assume SETH. For any i, j ∈ [n], let Sij = f (Qi , Qj ) = eQi Qj , and for any matrix
M
M ∈ Rn×n , let h(M ) ∈ Rn×n be the matrix where for all i, j ∈ [n], {h(M )}ij = Pn ijMik . Let
k1
Y = h(S) · V ∈ Rn×dv be self-attention. Then provided dq = ω(log n), for any ϵ > 0, computing a
matrix Ŷ = Rn×dv that satisfies the following condition requires Ω(n2−ϵ ) time:
2
|Ŷij − Yij | ≤ µ for all i ∈ [n] and j ∈ [dv ], where 0 ≤ µ ≤ e−3d log(n)−3d (additive
approximation)
22
O N T HE C OMPUTATIONAL C OMPLEXITY OF S ELF -ATTENTION
Proof The proof is technically similar. Suppose that the matrices Q and V are constructed according
to TVPP vector gadgets described in Section 4.2. With this we have Y = h(S) · V ∈ Rn×1 . Consider
the first n elements of Y . Since h(·) act as the row-wise softmax function, we have
2n n T
X X eCai bj
Yi = Sij =
CaT CaT
Pn Pn
i ak + i bk
j=n+1 j=1 k=1 e k=1 e
CaT
Pn
i bj
j=1 e
= Pn
CaT
i ak +
n CaT
i bk
P
k=1 e k=1 e
Case 1. There are no pairs a ∈ A and b ∈ B that satisfies aT b ≥ t, that is for all i, j ∈ [n],
T
we have aT b ≤ t − 1, and nj=1 eCai bj ≤ neC(t−1) . For a function x+y
x
P
, the maximum value is
neC(t−1) eC(t−1)
achieved at maximum x and minimum y values. Thus, for all l ∈ [n], Yl ≤ neC(t−1) +n
= eC(t−1) +1
.
C(t−1) C(t−1) 2
e
It is given that |Ŷl − Yl | ≤ µ , so Ŷl ≤ Yl + µ ≤ eC(t−1) e
+ µ ≤ eC(t−1) +e −3d log(n)−3d =: δ.
+1 +1
Case 2. There is a pair a ∈ A and b ∈ B with aT b ≥ t, that is for some i, j ∈ [n], we have
T
aT b ≥ t. Then the row sum corresponding to that i, j pair is nj=1 eCai bj ≥ eCt + (n − 1)e0 =
P
x
eCt + n − 1. For a function x+y , the maximum value is achieved at minimum x and maximum y
eCt +(n−1)
values. Thus, for some l ∈ [n], we have Yl ≥ eCt +(n−1)+ned
. It is given that |Ŷl − Yl | ≤ µ , so
eCt +(n−1) Ct
e +(n−1) 2
Ŷl ≥ Yl − µ ≥ eCt +(n−1)+ned
−µ≥ eCt +(n−1)+ned
− e−3d log(n)−3d =: ∆.
23