Random Sat
Random Sat
Random Sat
Armin Biere, Marijn Heule, Hans van Maaren and Toby Walsch
IOS Press, 2008
c 2008 Dimitris Achlioptas. All rights reserved.
Chapter 8
Random Satisfiability
Dimitris Achlioptas
8.1. Introduction
Satisfiability has received a great deal of study as the canonical NP-complete prob-
lem. In the last twenty years a significant amount of this effort has been devoted
to the study of randomly generated satisfiability instances and the performance
of different algorithms on them. Historically, the motivation for studying random
instances has been the desire to understand the hardness of typical instances. In
fact, some early results suggested that deciding satisfiability is easy on average.
Unfortunately, while easy is easy to interpret, on average is not.
One of the earliest and most often quoted results for satisfiability being easy
on average is due to Goldberg [Gol79]. In [FP83], though, Franco and Paull
pointed out that the distribution of instances used in the analysis of [Gol79]
is so greatly dominated by very satisfiable formulas that if one tries truth
assignments completely at random, the expected number of trials until finding a
satisfying one is O(1). Alternatively, Franco and Paull pioneered the analysis of
random instances of k-SAT, i.e., asking the satisfiability question for random k-
CNF formulas (defined precisely below). Among other things, they showed [FP83]
that for all k 3 the DPLL algorithm needs an exponential number of steps to
report all cylinders of solutions of such a formula, or that no solutions exist.
Random k-CNF formulas. Let Fk (n, m) denote a Boolean formula in Conjunc-
tive Normal Form with m clauses over n variables, where the clauses are chosen
uniformly, independently and without replacement among all 2k nk non-trivial
clauses of length k, i.e., clauses with k distinct, non-complementary literals.
Typically, k is a (small) fixed integer, while n is allowed to grow. In particular,
we will say that a sequence of random events En occurs with high probability
(w.h.p.) if limn Pr[En ] = 1 and with uniformly positive probability (w.u.p.p.)
if lim inf n Pr[En ] > 0. One of the first facts to be established [FP83] for
random k-CNF formulas is that Fk (n, rn) is w.h.p. unsatisfiable for r 2k ln 2.
A few years later, Chao and Franco [CF86, CF90] showed that, in contrast, if
r < 2k /k, a very simple algorithm will find a satisfying truth assignment w.u.p.p.
Combined, these two facts established m = (n) as the most interesting regime
for considering the satisfiability of random k-CNF formulas.
244 Chapter 8. Random Satisfiability
The study of random k-CNF formulas took off in the late 1980s. In a cel-
ebrated result, Chvatal and Szemeredi [CS88] proved that random k-CNF for-
mulas w.h.p. have exponential resolution complexity, implying that if F is a
random k-CNF formula with r 2k ln 2, then w.h.p. every DPLL-type algo-
rithm needs exponential time to prove its unsatisfiability. A few years later,
Chvatal and Reed [CR92] proved that random k-CNF formulas are satisfiable
w.h.p. for r = O(2k /k), strengthening the w.u.p.p. result of [CF90]. Arguably,
the biggest boost came from the experimental work of Mitchell, Selman and
Levesque [MSL92] and the analysis of these experiments by Kirkpatrick and Sel-
man in [KS94].
In particular, [SML96] gave extensive experimental evidence suggesting that
for k 3, there is a range of the clauses-to-variables ratio, r, within which it seems
hard even to decide if a randomly chosen k-SAT instance is satisfiable or not (as
opposed to reporting all cylinders of solutions or that no solutions exist). The
analysis of these experiments using finite-size scaling methods of statistical physics
in [KS94] showed that this peak in experimental decision complexity coincided
with a precipitous drop in the probability that a random formula is satisfiable.
For example, for k = 3 the analysis drew the followng remarkable picture: for
r < 4, a satisfying truth assignment can be easily found for almost all formulas;
for r > 4.5, almost all formulas are unsatisfiable; for r 4.2, a satisfying truth
assignment can be found for roughly half the formulas and around this point
the computational effort for finding a satisfying truth assignment, whenever one
exists, is maximized.
This potential connection between phase transitions and computational com-
plexity generated a lot of excitement. A particularly crisp question is whether the
probability of satisfiability indeed exhibits a sharp threshold around some critical
density (ratio of clauses to variables).
Satisfiability Threshold Conjecture. For every k 3, there exists a constant
rk > 0 such that,
(
1, if r < rk
lim Pr[Fk (n, rn) is satisfiable] =
n 0, if r > rk .
erties. Another is that random k-SAT instances remain computationally hard for
a certain range of densities, making them a popular benchmark for testing and
tuning satisfiability algorithms. In fact, some of the better practical ideas in use
today come from insights gained by studying the performance of algorithms on
random k-SAT instances [GSCK00].
The rest of this chapter is divided into two major pieces. The first piece is
an overview of the current state of the art regarding mathematical properties of
random k-CNF formulas where the number of clauses m = (n). The second
piece concerns the performance of different algorithms on sparse random k-CNF
formulas. Both pieces are concerned with (and limited to) rigorous mathematical
results. For experimental algorithmic results we refer the reader to the survey by
Cook and Mitchell [CM97]. For results using the methods of statistical physics
we refer the reader to Part 1, Chapter 9.
The satisfiability threshold conjecture remains open. A big step was made by
Friedgut [Fri99] who showed the existence of a sharp threshold for satisfiability
around a critical sequence of densities (rather than a single density rk ).
Theorem 1 ([Fri99]). For every k 3, there exists a sequence rk (n) such that
for any > 0,
(
1, if m = (rk (n) )n
lim Pr[Fk (n, m) is satisfiable] =
n 0, if m = (rk (n) + )n.
2k ln 2 (k) rk 2k ln 2 (1) .
The precise form of these bounds is given by Theorem 3, which we illustrate for
some small values of k in Table 8.1.
Theorem 3. There exists sequences k , k 0 such that for all k 3,
ln 2 1 + ln 2
2k ln 2 (k + 1) 1 k rk 2k ln 2 + k .
2 2
246 Chapter 8. Random Satisfiability
1 We use the term presults as a way of referring to p[hysics] results in this area, the
Table 8.1. Best known rigorous bounds for the location of the satisfiability threshold for some
small values of k. The last row gives the largest density for which a polynomial-time algorithm
has been proven to find satisfying assignments.
k 3 4 5 7 10 20
Best upper bound 4.51 10.23 21.33 87.88 708.94 726, 817
Best lower bound 3.52 7.91 18.79 84.82 704.94 726, 809
Algorithmic lower bound 3.52 5.54 9.63 33.23 172.65 95, 263
The methods used to derive bounds for the location of the satisfiability threshold,
also give bounds for the fraction of clauses that can be satisfied above it. Specif-
ically, let us say that a k-CNF formula with m clauses is p-satisfiable, for some
p [0, 1], if there exists an assignment satisfying at least 1 1p 2k
m clauses.
Observe that every k-CNF formula is 0-satisfiable, since the average number of
satisfied clauses over all assignments is (1 2k )m, while satisfiability is simply
1-satisfiability. Thus, given p (0, 1], the relevant quantity is the largest density,
rk (p), for which a random k-CNF formula is w.h.p. p-satisfiable. Analogously
to the case p = 1, the naive union bound over {0, 1}n for the existence of a
p-satisfying truth assignment implies rk (p) Tk (p), where
2k ln 2
Tk (p) = .
p + (1 p) ln(1 p)
Note that since limp1 Tk (p) = 2k ln 2, this recovers the naive satisfiability up-
per bound. Using the second moment method, in [ANP07] it was shown that
asymptotically, this upper bound is tight up to second order terms.
Theorem 4. There exists a sequence k = O(k2k/2 ), such that for all k 2
and p (0, 1],
(1 k ) Tk (p) rk (p) Tk (p) .
Algorithms fall far short from this bound. Analogously to satisfiability, the
best known algorithm [CGHS04] finds p-satisfying assignments only for r =
O(Tk (p)/k).
In [MZK+ , MZK+ 99a, MZ98, MZK+ 99b], Monasson et al., using mathemati-
cally sophisticated but non-rigorous techniques of statistical physics, initiated the
analytical study of random CNF formulas that are mixtures of 2- and 3-clauses,
which they dubbed (2 + p)-CNF. Such formulas arise for a number of reasons. For
example, a frequent observation when converting problems from other domains
into satisfiability problems is that they result into mixed CNF formulas with a
substantial number of clauses of length 2, along with the clauses of length 3.
Another reason is that DPLL algorithms run by recursively solving satisfiability
on residual formulas, restricted versions of their input CNF formula, which are
mixtures of clauses of length at least 2. When given random 3-CNF formulas
as input, many DPLL algorithms produce residual formulas that are mixtures of
random 2- and 3-clauses, making properties of random (2 + p)-CNF crucial for
analyzing their running time.
A random (2 + p)-CNF on n variables with m clauses is formed by choosing
pm clauses of length 3 and (1p)m clauses of length 2, amongst all clauses of each
length, uniformly and independently. Thus, p = 0 corresponds to random 2-SAT,
while p = 1 corresponds to random 3-SAT. Below we will find it convenient to
sidestep this original formulation and state results directly in terms of the number
of 2- and 3-clauses, not of the total number of clauses m and p.
The fact r2 = 1 implies that for any > 0 a random 2-CNF formula on n
variables with (1 /2)n clauses is w.h.p. satisfiable, but adding n 2-clauses to it
w.h.p. results in an unsatisfiable formula. The presults in [MZK+ 99b] suggested,
rather remarkably, that if instead of adding n random 2-clauses one adds up to
0.703...n random 3-clauses, the formula remains satisfiable. In other words, that
there is no finite exchange rate between 2- and 3-clauses.
Inspired by these claims, Achlioptas et al. [AKKK01] proved that
Theorem 5. A random CNF formula with n variables, (1)n 2-clauses and n
3-clauses is w.h.p. satisfiable for all > 0 and all 2/3, but w.h.p. unsatisfiable
for = 0.001 and = 2.28.
In other words, the physical prediction of an infinite exchange ratio is valid
as one can add at least 0.66n 3-clauses (and no more than 2.28n). In [Ach99] it
was conjectured that the inequality 2/3 in Theorem 5 is tight. That is,
Conjecture 6. For all > 2/3, there exists = () > 0, such that a random
CNF formula with n variables, (1 )n 2-clauses and n 3-clauses is w.h.p.
unsatisfiable.
Conjecture 6 is supported by a presult of Biroli, Monasson and Weigt [BMW00],
subsequent to [MZK+ , MZK+ 99a], asserting that 2/3 is indeed tight. As we will
see, if true, it implies that the running time of a large class of DPLL algorithms
exhibits a sharp threshold behavior: for each algorithm A, there exists a critical
density rA rk such that A takes linear time for r < rA , but exponential time
for r > rA .
We saw that sparse random k-CNF formulas are hard to prove unsatisfiable us-
ing resolution [CS88]. Ben-Sasson and Impagliazzo [BSI99] and Alekhnovich and
Chapter 8. Random Satisfiability 249
Razborov [AR01] proved that the same is true for the polynomial calculus, while
Alekhnovich [Ale05] proved its hardness for k-DNF resolution. Random k-CNF
formulas are believed to be hard for other proof systems also, such as cutting
planes, and this potential hardness has been linked to hardness of approxima-
tion [Fei02]. Moreover, the hardness of proving their unsatisfiability has been
explored for dense formulas, where for sufficiently high densities it provably dis-
appears [BKPS02, FGK05, GL03, COGLS04, COGL07]. For a more general
discussion of the connections between k-CNF formulas and proof-complexity see
Chapter ??. Here we will focus on the resolution complexity of random k-CNF
formulas as it has immediate and strong implications for the most commonly used
satisfiability algorithm, namely the DPLL procedure.
Theorem 7. For all k 3 and any constant r > 0, w.h.p. res(Fk (n, rn)) = 2(n) .
Corollary 8. Every DPLL algorithm w.h.p. takes exponential time on Fk (n, rn),
for any constant r 2k ln 2.
adding (1 )n random 2-clauses w.h.p. has essentially no effect on its proof
complexity.
Theorem 9 allows one to readily prove exponential lower bounds for the run-
ning times of DPLL algorithms for satisfiable random k-CNF formulas. This is
because many natural DPLL algorithms when applied to random k-CNF formulas
generate at least one unsatisfiable subproblem consisting of a random mixture of
2- and higher-length clauses, where the 2-clauses alone are satisfiable. (We will
discuss how, when and for which algorithms this happens in greater detail in Sec-
tion 8.9.) By Theorem 9, such a mixture has exponential resolution complexity
(converting k-clauses with k > 3 to 3-clauses arbitrarily can only reduce resolu-
tion complexity) and, as a result, to resolve any such subproblem (and backtrack)
any DPLL algorithm needs exponential time.
Random k-SAT, along with other random Constraint Satisfaction Problems, such
as random graph coloring and random XOR-SAT have also been studied system-
atically in physics in the past two decades. For a general introduction and ex-
position to the physical methods see Part 1, Chapter 9. In particular, motivated
by ideas developed for the study of materials known as spin glasses, physicists
have put forward a wonderfully complex picture about how the geometry of the
set of satisfying assignments of a random k-CNF formula evolves as clauses are
added. Perhaps the most detailed and sophisticated version of this picture comes
from [KMRT+ 07].
Roughly speaking, statistical physicists have predicted (using non-rigorous
but mathematically sophisticated methods) that while for low densities the set of
satisfying assignments forms a single giant cluster, at some critical density this
cluster shatters into exponentially many clusters, each of which is relatively tiny
and far apart from all other clusters. These clusters are further predicted to be
separated from one another by huge energy barriers, i.e., every path connecting
satisfying assignments in different clusters must pass through assignments that
violate (n) constraints. Moreover, inside each cluster the majority of variables
are predicted to be frozen, i.e., take the same value in all solutions in the cluster;
thus getting even a single frozen variable wrong requires traveling (n) away and
over a huge energy barrier to correct it.
The regime of exponentially many tiny clusters, each one having constant
probability of vanishing every time a clause is added (due to its frozen variables),
is predicted in [KMRT+ 07] to persist until very close to the threshold, namely
for densities up to
3 ln 2
2k ln 2 + o(1) , (8.1)
2
where the o(1) term is asymptotic in k. In comparison, the satisfiability threshold
Chapter 8. Random Satisfiability 251
1 + ln 2
2k ln 2 + o(1) , (8.2)
2
i.e., fewer than 0.2n k-clauses later. For densities between (8.1) and (8.2) , it is
predicted that nearly all satisfying assignments lie in a small (finite) number of
(atypically large) clusters, while exponentially many clusters still exist.
Perhaps the most remarkable aspect of the picture put forward by physicists is
that the predicted density for the shattering of the set of solutions into exponen-
tially many clusters scales, asymptotically in k, as
2k
ln k , (8.3)
k
fitting perfectly with the fact that all known algorithms fail at some density
2k
cA , (8.4)
k
where the constant cA depends on the algorithm. We will refer to this phe-
nomenon as the algorithmic barrier. We note that so far there does not appear
to be some natural general upper bound for cA for the types of algorithms that
have been analyzed, i.e., more sophisticated algorithms do have a greater con-
stant, but with rapidly diminishing returns in terms of their complexity.
By now a substantial part of the picture put forward by physicists has been made
rigorous, at least for k 8. Success for smaller k stumbles upon the fact that
the rigorous results rely on the second moment method which does not seem to
perform as well for small k.
For the definitions in the rest of this section we assume we are dealing with
an arbitrary CNF formula F defined over variables X = x1 , . . . , xn , and we let
S(F ) {0, 1}n denote its set of satisfying assignments. We say that two assign-
ments are adjacent if their Hamming distance is 1 and we let HF : {0, 1}n N
be the function counting the number of clauses of F violated by each {0, 1}n .
One can get an easy result regarding the solution-space geometry of very
sparse random formulas by observing that if a formula F is satisfiable by the
pure literal rule alone, then the set S(F ) is connected (recall that the pure literal
rule amounts to permanently satisfying any literal ` whose complement does not
appear in the formula and permanently removing all clauses containing `; the
252 Chapter 8. Random Satisfiability
proof of connectivity is left as an exercise). The pure literal rule alone w.h.p.
finds satisfying assignments in random k-CNF formulas with up to k n clauses,
for some k 0, so this is very far away from the densities up to which the best
known algorithms succeed, namely O(2k /k).
The following theorem of Achlioptas and Coja-Oghlan [ACO08], on the other
hand, asserts that for all k 8, precisely at the asymptotic density predicted
by physics in (8.3), the set of satisfying assignments shatters into an exponential
number of well-separated regions. Given two functions f (n), g(n), let us write
f g if = limn f (n)/g(n) = 1. Recall that the lower bound for the location
of the satisfiability threshold provided by the second moment method is sk
2k ln 2 (k + 1) ln22 1 o(1) 2k ln 2.
Theorem 11. For all k 8, there exists ck < sk such that for all r (ck , sk ),
the set S(Fk (n, rn)) w.h.p. consists of exponentially many regions where:
1. Each region only contains an exponentially small fraction of S.
2. The Hamming distance between any two regions is (n).
3. Every path between assignments in distinct regions has height (n).
In particular, ck ln k 2k /k .
The picture of Theorem 11 comes in even sharper focus for large k. In partic-
ular, for sufficiently large k, sufficiently close to the threshold, the regions become
arbitrarily small and maximally far apart, while still exponentially many.
Theorem 12. For any 0 < < 1/3, fix r = (1 )2k ln 2, and let
1 1 5
k = , k = , k = 3k 2 .
k 2 6 2
For all k k0 (), w.h.p. the set S = S(Fk (n, rn)) consists of 2k n regions where:
1. The diameter of each region is at most k n.
2. The distance between every pair of regions is at least k n.
Remark 13. The presults of [KMRT+ 07] state that the picture of Theorem 11
should hold for all k 4, but not for k = 3. In particular for k = 3, the solution-
space geometry is predicted to pass from a single cluster to a condensed phase
directly.
It turns out that after the set of satisfying assignments has shattered into
exponentially many clusters, we can look inside these clusters and make some
statements about their shape. For that, we need the following definition.
Chapter 8. Random Satisfiability 253
Recall that the best upper bound for the satisfiability threshold scales as (2k ),
whereas all efficient algorithms known work only for densities up to O(2k /k). To
resolve whether this gap was due to a genuine lack of solutions, as opposed to
the difficulty of finding them, Achlioptas and Moore [AM06] introduced an ap-
proach which allows one to avoid the pitfall of computational complexity: namely,
using the second-moment method one can prove that solutions exist in random
instances, without the need to identify any particular solution for each instance
(as algorithms do). Indeed, if random formulas are genuinely hard at some densi-
ties below the satisfiability threshold, then focusing on the existence of solutions
rather than their efficient discovery is essential: one cannot expect algorithms to
provide accurate results on the thresholds location; they simply cannot get there!
Before we delve into the ideas underlying some of the results mentioned above
it is helpful to establish the probabilistic equivalence between a few different
models for generating random k-CNF formulas.
The second moment method approach [ANP05] rests on the following basic fact:
every non-negative random variable X satisfies Pr[X > 0] E[X]2 /E[X 2 ]. Given
any k-SAT instance F on n variables, let X = X(F ) be its number of satisfying
assignments. By computing E[X 2 ] and E[X]2 for random formulas with a given
density r one can hope to get a lower bound on the probability that X > 0, i.e.,
that Fk (n, rn) is satisfiable. Unfortunately, this direct application fails dramati-
cally as E[X 2 ] is exponentially (in n) greater than E[X]2 for every density r > 0.
Nevertheless, it is worth going through this computation below as it points to
the source of the problem and also helps in establishing the existence of clusters.
We perform all computations below in the model where clauses are chosen with
replacement from Ck .
For a k-CNF formula with m clauses chosen independently with replacement
it is straightforward to show that the number of satisfying assignments X satisfies
n
2 n
X
n
E[X ] = 2 fS (z/n)m , (8.5)
z=0
z
where fS () = 121k +2k k is the probability that two fixed truth assignments
that agree on z = n variables, both satisfy a randomly drawn clause. Thus,
(8.5) decomposes the second moment of X into the expected number of pairs of
satisfying assignments at each possible distance.
Observe that f is an increasing function of and that fS (1/2) = (1 2k )2 ,
i.e.,truth assignments at distance n/2 are uncorrelated. Using the approximation
n 1 n
n = ( (1 ) ) poly(n) and letting
2 fS ()r
S () =
(1 )1
we see that
n
E[X 2 ] = max S () poly(n) ,
01
E[X] = S (1/2)n .
2
Chapter 8. Random Satisfiability 255
Therefore, if there exists some 6= 1/2 such that S () > S (1/2), then the
second moment is exponentially greater than the square of the expectation and we
only get an exponentially small lower bound for Pr[X > 0]. Put differently, unless
the dominant contribution to E[X 2 ] comes from uncorrelated pairs of satisfying
assignments, i.e., pairs with overlap n/2, the second moment method fails.
Unfortunately, this is precisely what happens for all r > 0 since the entropic
factor E() = 1/( (1 )1 ) in S is symmetric around = 1/2, while fS
is increasing in (0, 1), implying 0S (1/2) > 0. That is, S is maximized at some
> 1/2 where the correlation benefit balances the penalty of decreased entropy.
While the above calculation does not give us what we want, it is still useful.
For example, for any real number [0, 1], it would be nice to know the number of
pairs of satisfying truth assignments that agree on z = n variables in a random
formula. Each term in the sum in (8.5) gives us the expected number of such
pairs. While this expectation may overemphasize formulas with more satisfying
assignments (as they contribute more heavily to the expectation), it still gives
valuable information on the distribution of distances among truth assignments in
a random formula. For example, if for some values of z (and k, r) this expectation
tends to 0 with n, we can readily infer that w.h.p. there are no pairs of truth
assignments that agree on z variables in Fk (n, rn). This is because for any integer-
valued random variable Y , Pr[Y > 0]] E[Y ]. Indeed, this simple argument is
enough to provide the existence of clustering for all k 8, at densities in the
(2k ) range (getting down to (2k /k) ln k requires a lot more work).
To prove the existence of exponentially many regions one divides a lower bound for
the total number of satisfying assignments with an upper bound for the number
of truth assignments in each region. The lower bound comes from the expected
number of balanced satisfying assignments since the success of the second moment
method for such assignments, which we will see immediately next, implies that
w.h.p. the actual number of balanced assignments is not much lower than its
expectation. For the upper bound, one bounds the total number of pairs of truth
assignments in each region as poly(n) g(k, r)n , where
g(k, r) = max S (, k, r) ,
[0,]
An attractive feature of the second moment method is that we are free to apply it
to any random variable X = X(F ) such that X > 0 implies that F is satisfiable.
With this in mind, let us consider random variables of the form
XY
X= w(, c)
c
256 Chapter 8. Random Satisfiability
where fw (z/n) = E[w(, c) w(, c)] is the correlation, with respect to a single
random clause c, between two truth assignments and that agree on z variables.
It is also not hard to see that fw (1/2) = E[w(, c)]2 , i.e., truth assignments at
distance n/2 are uncorrelated for any function w. Thus, arguing as in the previous
section, we see that E[X 2 ] is exponentially greater than E[X]2 unless fw0 (1/2) = 0.
At this point we observe that since we are interested in random formulas
where literals are drawn uniformly, it suffices to consider functions w such that:
for every truth assignment and every clause c = `1 `k , w(, c) = w(v),
where vi = +1 if `i is satisfied under and 1 if `i is falsified under . (So, we
will require that w(1, . . . , 1) = 0.) Letting A = {1, +1}k and differentiating
fw , yields the geometric condition
X
fw0 (1/2) = 0 w(v)v = 0 . (8.6)
vA
The condition in the r.h.s. of (8.6) asserts that the vectors in A, when scaled
by w(v), must cancel out. This gives us another perspective on the failure of
the vanilla second moment method: when w = wS is the indicator variable for
satisfiability, the condition in the right hand side of (8.6) does not hold since the
vector (1, 1, . . . , 1) has weight 0, while all other v A have weight 1.
Note that the r.h.s. of (8.6) implies that in a successful w each coordinate
must have mean 0, i.e., that each literal must be equally likely to be +1 or 1
when we pick truth assignments with probability proportional to their weight
under w. We will call truth assignments with km/2 O( m) satisfied literal
occurrences balanced. As was shown in [AP04], if X is the number of balanced
satisfying assignments then E[X 2 ] < CE[X]2 , for all r sk , where C = C(k) > 0
is independent of n. Thus, Pr[X > 0] 1/C and by Corollary 2 we get rk sk .
To gain some additional intuition on the success of balanced assignments it
helps to think of Fk (n, m) as generated in two steps: first choose the km lit-
eral occurrences randomly, and then partition them randomly into k-clauses. At
the end of the first step, truth assignments that satisfy many literal occurrences
cleary have significantly greater conditional probability of eventually being sat-
isfying assignments. But such assignments are highly correlated with each other
since in order to satisfy many literal occurrences they tend to agree with the
majority truth assignment on more than half the variables. Focusing on balanced
assignments only, curbs the tendency of satisfying assignments to lean towards
the majority vote assignment.
Finally, we note that it is not hard to prove that for r sk + O(1), w.h.p.
Fk (n, rn) has no satisfying truth assignments that only satisfy km/2 + o(n) literal
occurrences. Thus, any asymptotic improvement over the lower bound for rk pro-
vided by balanced assignments would mean that tendencies toward the majority
assignment become essential as we approach the threshold. By the same token,
Chapter 8. Random Satisfiability 257
we note that as k increases the influence exerted by the majority vote assign-
ment becomes less and less significant as most literals occur very close to their
expected kr/2 times. As a result, as k increases, typical satisfying assignments
get closer and closer to being balanced, meaning that the structure of the space
of solutions for small values of k (e.g., k = 3, 4) might be significantly different
from the structure for large values of k, something also predicted by the physical
methods.
8.7. Algorithms
For the purposes of this chapter we will categorize satisfiability algorithms into
two broad classes: DPLL algorithms and random walk algorithms. Within the for-
mer, we will distinguish between backtracking and non-backtracking algorithms.
More concretely, the DPLL procedure for satisfiability is as follows:
DPLL(F )
1. Repeatedly satisfy any pure literals and 1-clauses.
If the resulting formula F 0 is empty, exit reporting satisfiable.
If a contradiction (0-clause) is generated, exit.
2. Select a variable x F 0 and a value v for x
0
3. DPLL(Fv=x )
0
4. DPLL(Fv=1x )
Clearly, different rules for performing Step 2 give rise to different algorithms
and, in practice, the complexity of these rules can vary from minimal to huge.
Perhaps the simplest possible rule is to consider the variables in a fixed order, e.g.,
x1 , x2 , . . . , xn , always selecting the first variable in the order that is present in the
formula, and always setting x to the same value, e.g., x = 0. If one forgoes the
pure literal rule, something which simplifies the mathematical analysis on random
formulas, the resulting algorithms in known as ordered dll. If one also forgoes
backtracking, the resulting algorithm is known as uc, for Unit Clause propagation,
and was one the first algorithms to be analyzed on random formulas.
violated literals select one at random and flip it, due to Papadimitriou [Pap91],
succeeds in linear time on random 3-CNF for densities as high as 1.63, i.e., up to
the largest density for which the pure literal rule succeeds (recall that the success
of the pure literal implies that the set of satisfying assignments is connected).
It is worth pointing out that, at this point, there exist both very interesting
presults regarding the performance of random walk algorithms on random for-
mulas [CMMS03, SM03] and also very interesting mathematical results [FMV06,
COMV07] regarding their performance on the planted model, i.e., on formulas
generated by selecting uniformly at random among all 2k1 nk k-clauses satis-
fied by a fixed (planted) assignment. Perhaps, our recent understanding of the
evolution of the solution-space geometry of random k-CNF formulas will help in
transferring some of these results.
by Kaporis, Kirousis and Lalas [KKL06] and Hajiaghayi and Sorkin [HS03].
It seems likely that this bound can be further improved, at least slightly, by
making even more refined considerations in the choice of variable and value,
but it also seems clear that this is well within the regime of diminishing
returns.
For k > 3, the best results come from consider the following very natural
shortest clausealgorithm sc: pick a random clause c of minimum length;
among the literals in c pick one at random and satisfy it. A weakening
of this algorithm which in the absence of 1-, 2-, and 3-clauses selects a
random literal to satisfy was analyzed by Frieze and Suen in [FS96] and
gives the best known algorithmic lower bound for the k-SAT threshold for
k > 3, namely rk k 2k /k, where k 1.817. In [FS96], the authors
give numerical evidence that even the full sc algorithm only succeeds up
to some k 2k /k, where k = O(1).
In a nutshell, the only algorithms that have been proven to find satisfying
assignments efficiently in random k-CNF formulas are extremely limited and only
succeed for densities up to O(2k /k). In particular, both their variable ordering
and their value assignment heuristic can be implemented given very little, and
completely local, information about the variabe-clause interactions. Of course,
this limitation is also what enables their analysis.
the uniform measure over satisfying assignments but, rather, (an approximation
of) the uniform measure over clusters, where each cluster is represented by the
fixed point of a certain iterative procedure applied to any assignment in the clus-
ter.
That said, for all densities below the condensation transition, SP is not strictly
necessary: if SP can compute variable marginals, then so can a much simpler al-
gorithm called Belief Propagation, i.e., dynamic programming on trees. This
is because when the measure is carried by exponentially many well-scattered
clusters, marginals are expected to decorrelate. Indeed Gershenfeld and Monta-
nari [GM07] gave very strong rigorous evidence that BP succeeds in computing
marginals in the uncondensed regime for the coloring problem. So, although SP
might be useful when working very close to the threshold, it is not readily helpful
in designing an algorithm that can provably find solutions even at much lower
densities, e.g., say at r = 2k2 , roughly in the middle of the satisfiable regime.
One big obstacle is that, currently, to use either BP or SP to find satisfy-
ing assignments one sets variables iteratively. When a constant fraction of the
variables are frozen in each cluster, as is the case after the dynamical transition,
setting a single variable typically eliminates a constant fraction of all clusters. As
a result, very quickly, one can be left with so few remaining clusters that decor-
relation stops to hold. Concretely, in [MRTS07], Montanari, Ricci-Tersenghi and
Semerjian showed that (even with the relatively generous assumptions of statisti-
cal physics computations) the following algorithm fails for densities greater than
ln k 2k /k. That is, step 2 below fails to converge after only a small fraction of
all variables have been assigned a value:
1. Select a variable v at random.
2. Compute the marginal distribution of v using Belief Propagation.
3. Set v to {0, 1} according to the computed marginal distribution; simplify
the formula; go to step 1.
hard to have a compact probabilistic model for the residual formula. As a result,
a probabilistic analysis akin to that possible for r < rA appears beyond the
reach of current mathematical techniques (but see [CM04, CM05, Mon05] for an
analysis using techniques of statistical physics). Nevertheless, for all backtracking
extensions of myopic algorithms it is possible to prove that they take exponential
time when the initial k-CNF formula is above a certain critical density. This is
because of the following immediate implication of Theorem 9.
Corollary 19. If a DPLL algorithm ever generates a residual formula that is an
unsatisfiable mixture of uniformly random clauses in which the 2-clause density
is below 1, then w.h.p. it will spend exponential time before backtracking from it.
That is, by Corollary 19, once a node in the backtracking search is reached
that corresponds to an unsatisfiable random mixture (but where the 2-clauses
alone are satisfiable), the search cannot leave the sub-tree for an exponentially
long time. Standard results (see e.g. [Ach01]) imply that w.u.p.p. this is precisely
what happens for uc started with 3.81n 3-clauses and for sc started with 3.98n 3-
clauses. This is because for such initial densities, at some point, the corresponding
algorithm w.u.p.p. generates a residual (2+p)-CNF formula which is unsatisfiable
w.h.p. per the results of Theorem 5.
Theorem 20. For any constant r 3.81, any backtracking extension of uc
w.u.p.p. takes time 2(n) on F3 (n, rn). Similarly for sc and r 3.98.
We note that the only reason for Theorem 20 is not a high probability result
is that w.u.p.p. each algorithm might generate a contradiction and backtrack,
thus destroying the uniform randomness of the residual formula, before creat-
ing a formula like the one mandated by Corollary 19. It is worth pointing out,
though, that whenever this occurs w.h.p. it is for trivial local reasons. In partic-
ular, Frieze and Suen in [FS96], introduced the following form of backtracking:
when a contradiction is reached, record the portion of the assignment between
the last free choice and the contradiction; these literals become hot. After flip-
ping the variable involved in the last free choice, instead of making the choice
that the original heuristic would suggest, give priority to the complements of the
hot literals in the order that they appeared; once the hot literals are exhausted
continue as with the original heuristic. This backtracking rule is quite natural in
that it is the last part of the partial assignment that got is into trouble in the first
place. Moreover, it appears to be a genuinely good idea. Experiments on random
formulas comparing this backtracking extension of uc with just reversing the last
free choice show that the histogram of run-times is significantly better for a large
range of densities [ABM04b]. Another property of this backtracking rule is that
as long as the value of each variable in a partial assignment has been flipped at
most once, as happens when dealing with trivial, local contradictions, the resid-
ual formula is uniformly random. In particular, for this particular backtracking
extension Theorem 20 holds w.h.p.
Theorem 20 sheds light on a widely-cited observation of Mitchell, Selman and
Levesque [MSL92], based on experiments with ordered-dll on small problems,
stating that random 3-SAT is easy in the satisfiable region up to the 4.2 threshold,
becomes sharply much harder at the threshold and quickly becomes easy again at
Chapter 8. Random Satisfiability 263
larger densities in the unsatisfiable region. The upper end of this easy-hard-easy
characterization is somewhat misleading since, as we saw, the result of [CS88] in
fact asserts that w.h.p. random 3-CNF formulas only have exponential-size proofs
of unsatisfiability above the threshold. By now, the rate of decline in running time
as the density is increased has been analyzed as well by Beame et al. [BKPS02].
Theorem 20 shows that the lower end of this characterization is also misleading in
that the onset of exponential behavior occurs significantly below the (conjectured)
satisfiability threshold at density 4.2. This concurs with experimental evidence
that even the best of current DPLL implementations seem to have bad behavior
below the threshold [CDA+ 03]. Moreover, as we will see shortly, the gap between
the onset of exponential behavior and the satisfiability threshold increases rapidly
with k.
Corollary 19 probably points to a much larger truth than what is specifically
derived for the algorithms and backtracking schemes mentioned above. This is
because the proof of Theorem 9 is quite robust with respect to the probability
distribution of the clauses in the mixture. The essential ingredient seems to be
that the variable-clause incidence graph is an expander, suggesting that, in fact,
random k-CNF formulas are not the only formulas for which one could hope
to prove a result similar to Theorem 20. Moreover, precisely the combinatorial
richness of expanders suggests that restarting a DPLL algorithm on a random
k-CNF formula is unlikely to yield dramatically different results from run to run,
unless, of course, one is willing to restart an exponential number of times.
The bounds on the 3-clause density needed to cause exponential behavior
in satisfiability algorithms will be readily improved with any improvement on
the 2.28n upper bound for unsatisfiability in random (2 + p)-SAT. In particu-
lar, if Conjecture 6 is true, then Theorem 9 implies [ABM04b] that the running
time of every myopic algorithm goes from linear to exponential around a critical,
algorithm-specific density.
Theorem 21. For k = 4 and r 7.5 and for k 5 and r (11/k) 2k2 ,
ordered-dll w.u.p.p. requires time 2(n) on Fk (n, rn).
References
[CR92] Vasek Chvatal and B. Reed. Mick gets some (the odds are on his
side). In FOCS, pages 620627, 1992.
[CS88] Vasek Chvatal and Endre Szemeredi. Many hard examples for res-
olution. J. Assoc. Comput. Mach., 35(4):759768, 1988.
[DB97] Olivier Dubois and Yacine Boufkhad. A general upper bound for the
satisfiability threshold of random r-SAT formulae. J. Algorithms,
24(2):395420, 1997.
[DBM03] Olivier Dubois, Yacine Boufkhad, and Jacques Mandler. Typical
random 3-sat formulae and the satisfiability threshold. Electronic
Colloquium on Computational Complexity (ECCC), 10(007), 2003.
[FdlV92] Wenceslas Fernandez de la Vega. On random 2-SAT. 1992. Unpub-
lished Manuscript.
[Fei02] Uriel Feige. Relations between average case complexity and approx-
imation complexity. In STOC, pages 534543, 2002.
[FGK05] Joel Friedman, Andreas Goerdt, and Michael Krivelevich. Recog-
nizing more unsatisfiable random k-SAT instances efficiently. SIAM
J. Comput., 35(2):408430 (electronic), 2005.
[FMV06] Uriel Feige, Elchanan Mossel, and Dan Vilenchik. Complete conver-
gence of message passing algorithms for some satisfiability problems.
In APPROX-RANDOM, pages 339350, 2006.
[FP83] John Franco and Marvin Paull. Probabilistic analysis of the Davis
Putnam procedure for solving the satisfiability problem. Discrete
Appl. Math., 5(1):7787, 1983.
[Fra01] John Franco. Results related to threshold phenomena research in
satisfiability: lower bounds. Theoret. Comput. Sci., 265(1-2):147
157, 2001.
[Fri99] Ehud Friedgut. Sharp thresholds of graph properties, and the k-SAT
problem. J. Amer. Math. Soc., 12:10171054, 1999.
[FS96] Alan Frieze and Stephen Suen. Analysis of two simple heuristics on
a random instance of k-SAT. J. Algorithms, 20(2):312355, 1996.
[GL03] Andreas Goerdt and Andre Lanka. Recognizing more random un-
satisfiable 3-SAT instances efficiently. In Typical case complexity
and phase transitions, volume 16 of Electron. Notes Discrete Math.,
page 26 pp. (electronic). Elsevier, Amsterdam, 2003.
[GM07] Antoine Gerschenfeld and Andrea Montanari. Reconstruction for
models on random graphs. In FOCS, pages 194204, 2007.
[Goe96] Andreas Goerdt. A threshold for unsatisfiability. J. Comput. System
Sci., 53(3):469486, 1996.
[Gol79] Allen Goldberg. On the complexity of the satisfiability problem. In
4th Workshop on Automated Deduction (Austin, TX, 1979), pages
16, 1979.
[GSCK00] Carla P. Gomes, Bart Selman, Nuno Crato, and Henry Kautz.
Heavy-tailed phenomena in satisfiability and constraint satisfaction
problems. J. Automat. Reason., 24(1-2):67100, 2000.
[HS03] Mohammad Taghi Hajiaghayi and Gregory B. Sorkin. The satisfia-
bility threshold of random 3-SAT is at least 3.52. volume RC22942
of IBM Research Report. 2003.
Chapter 8. Random Satisfiability 267