A. The Lesson
A. The Lesson
A. The Lesson
W.T. Gowers
This course will cover a miscellaneous collection of topics in combinatorics and closely
related fields. What the topics have in common is that they all involve proofs that at
one time surprised experts by their simplicity. Sometimes these were the first proofs of
long-standing open problems, and sometimes they were new proofs of results that had
previously been established by much longer arguments. Several of these arguments use
ideas and techniques that have gone on to be used by many other people.
Another theme of the course is the sheer diversity of methods that are used in combi-
natorics. We shall see uses of probability, linear algebra, linear analysis, topology, entropy,
multivariate polynomials, tensor rank, concentration of measure, and more. (There will
also be one or two arguments that are completely elementary.)
Contents
1 Averaging arguments 2
1.1 Sperner’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 The Erdős-Ko-Rado theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 The Szemerédi-Trotter theorem . . . . . . . . . . . . . . . . . . . . . . . . 5
1
7 Entropy arguments 29
7.1 The Khinchin axioms for entropy and some simple consequences . . . . . . 29
7.2 The number of paths of length 3 in a bipartite graph . . . . . . . . . . . . 32
7.3 The formula for entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.3.1 The axioms uniquely determine the formula. . . . . . . . . . . . . . 36
7.4 Brégman’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.5 Shearer’s entropy lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10 Dimension arguments 55
10.1 Subsets of Rn that give rise to at most two distances . . . . . . . . . . . . 55
10.2 Set systems with intersections of restricted parity . . . . . . . . . . . . . . 56
10.3 More general restricted intersections . . . . . . . . . . . . . . . . . . . . . . 57
10.4 Kahn and Kalai’s disproof of Borsuk’s conjecture . . . . . . . . . . . . . . 59
1 Averaging arguments
Let a1 , . . . , an be real numbers with average a. Then there exists i such that ai ≥ a and
there exists i such that ai ≤ a. More generally, we have the following extremely basic fact.
Proposition 1.1. If X is a random variable, then P[X ≥ EX] > 0 and P[X ≤ EX] > 0.
Proof. It might seem as though this fact is too obvious even to need a proof. That is
almost true, and for the purposes of this course there is no problem if you just assume it,
but for completeness here is a proof.
If P[X ≥ EX] = 0, then for each n let En be the event that
X X
EX ≤ (EX − n−1 )P[En ] = EX − n−1 P[En ],
n n
which is strictly less than EX , since there must exist n with P[En ] > 0. This is a contra-
diction.
2
Another way of stating the proposition is to say that EX ∈ [min X, max X]. Bizarrely,
this fact, which is particularly basic if the random variable X is discrete, the case that will
normally concern us, can quite often be used to prove highly non-trivial theorems. This
section will discuss three main examples: Sperner’s theorem, the Erdős-Ko-Rado theorem,
and the Szemerédi-Trotter theorem.
Before I get on to those, let me mention a closely related very basic principle, which
underlies so-called double-counting arguments. It can be formalized in various ways, but
here’s a simple one.
Proposition 1.2. Let G be a bipartite graph with finite vertex sets X and Y . Suppose
that the average degree of a vertex in X is δ|Y |. Then the average degree of a vertex in Y
is δ|X|.
Proof. The number of edges of G is δ|X||Y | and the result follows instantly.
This is “double counting” because we are counting the number of edges in two different
ways: as |X| times the average degree of a vertex in X, and as |Y | times the average degree
of a vertex in Y .
The following proposition, which has essentially the same proof, perhaps captures better
how double-counting arguments often work.
Proposition 1.3. Let G be a bipartite graph with finite vertex sets X and Y . Suppose that
the average degree of a vertex in X is at least d1 and the average degree of a vertex in Y
is at most d2 . Then |Y | ≥ d1 |X|/d2 .
Proof. The number of edges of G is at least d1 |X| and at most d2 |Y |. The result follows.
Here are two simple examples of double-counting arguments.
Example 1.4. Let G be a planar graph with V vertices, E edges and F faces. Then each
face contains at least three edges and each edge is contained in at most two faces. It follows
from Proposition 1.3 that E ≥ 3F/2. (Let X be the set of faces of G and Y be the set of
edges of G, and join a face to an edge if and only if the face contains the edge. Then apply
the proposition.)
Before I give the next example, let me introduce some standard notation that will be
useful throughout the course. I shall write [n] for {1, 2, . . . , n}, and [n](r) for the set of
subset of [n] of size r.
Example 1.5. Let 0 ≤ r < s ≤ n, let A be a subset of [n](r) , and let ∂s A be the s-upper
shadow of A, which is defined to be {B ∈ [n](s) : ∃A ∈ A s.t. A ⊂ B}. Then every set in
A is contained in n−r
n−s
sets in ∂s A, and every set in ∂ s A contains at most s
r
sets in A. It
n−r s (n−r)!r!
follows that |∂s A| ≥ n−s |A|/ r , which can also be written as (n−s)!s! |A|, and therefore
as ns |A|/ nr .
3
1.1 Sperner’s theorem
Let A be a family of subsets of [n]. How large can A be if no member of A is contained in
any other? A simple way to construct such a family is to ensure that all its members have
the same size. Sperner showed that that was the best one can do.
Theorem 1.6 (Sperner). Let A be a family of subsets of [n] such that no member of A is
n
contained in any other. Then |A| ≤ bn/2c .
Proof. For 0 ≤ i ≤ n let Ai be the set A∩[n](i) – that is, the collection of sets in A of size i.
Let x1 , . . . , xn be a random ordering of [n] and for 0 ≤ i ≤ n let Ei be the set {x1 , . . . , xi }.
Then the number of i such that Ei ∈ A is at most 1, since no member of A is contained
in any other. But for eachPi, Ei is uniformly distributed in [n](i) , so the expected number
of i such that Ei ∈ A is ni=0 |Ai |/ ni . (Here we used linearity of expectation, another
very
Pn simple nprinciple that surprisingly often has non-trivial consequences.) It follows that
n
i=0 |A i |/ i
≤ 1, and therefore, since the largest size of a binomial coefficient is bn/2c ,
n
P
that |A| = i |Ai | ≤ bn/2c as claimed.
Note that the last step in the proof above threw away a lot of interesting information:
the above proof shows that, as one might expect, including lots of sets that are either very
small or very big makes it harder to avoid containments. The proof itself is not the original
proof of Sperner.
Another remark is that the proof used the contrapositive of Proposition 1.1.
4
To see this, let us assume, without loss of generality, that I1 = {x1 , . . . , xk } belongs
to the family. Also, for each j let us define Ij− to be the interval {xj−k , . . . , xj−1 }. Then
every interval that belongs to the family must intersect I1 , which means that it must be
either Ij for some j ∈ {1, 2, . . . , k} or Ij− for some j ∈ {2, 3, . . . , k}. (I did not mention
−
Ik+1 because it is equal to I1 , but this is not an important point.) Also, we cannot include
both Ij and Ij− since they are disjoint (because k ≤ n/2). It follows, as claimed, that at
most k intervals can belong to an intersecting family.
For each j, the probability that Ij belongs to the family is |A|/ nk , since Ij is uniformly
n
distributed. So the expected number of intervals that belong to the family is n|A|/ k
.
n k n n−1
Therefore, n|A|/ k ≤ k, and therefore |A| ≤ n k = k−1 .
Note that if equality holds, then for every cyclic ordering there must be k consecutive
intervals in A. (Here we are using the fact that if P[X ≤ EX] = 1, then P[X = EX] = 1.)
But if exactly k intervals belong to an intersecting family, then we can say slightly more
than we said above. For 1 < j < k, it is not possible for Ij− and Ij+1 both to belong to A
(here we use the fact that n ≥ 2k +1), but for each such 1 < j ≤ k we have to choose either
Ij− or Ij in order to have enough intervals. Therefore, if ever we choose Ij− , then we must
− −
choose Ij+1 , which forces us to choose Ij+2 , and so on all the way up to Ik− . It follows that
−
there is some 1 ≤ r ≤ k such that the intervals we choose are I1 , I2 , . . . , Ir , Ir+1 , . . . , Ik− , or
in other words all intervals of length k that contain xr .
This observation enables us to show uniqueness of the extremal example. Let x1 , . . . , x2k−1
be such that the intervals {xj , . . . , xj+k−1 } with 1 ≤ j ≤ k all belong to A, and let u be an
element not equal to any of x1 , . . . , x2k−1 (which exists since 2k ≤ n). Now let A be any set
of size k that contains xk , and enumerate its elements as {y1 , . . . , yk } in such a way that
y1 , . . . , yr belong to the set {x1 , . . . , xk }, yr = xk , and yr+1 , . . . , yk do not. Now construct
a cyclic order of [n] that begins with u, then continues with an ordering of x1 , . . . , xk that
finishes y1 , . . . , yr , and continues after that with yr+1 , . . . , yk (and then finishes in some
arbitrary way). Write this ordering as z0 , z1 , . . . , zn−1 with z0 = u. Then the interval
{z0 , z1 , . . . , zk−1 } is equal to the set {u, x1 , . . . , xk−1 }, which does not belong to A because
it does not intersect {xk , . . . , x2k−1 }. Since there must be k consecutive intervals in this
cyclic order that belong to A, and {z1 , . . . , zk } = {x1 , . . . , xk } does, it follows that A, which
equals {zk−r+1 , . . . , z2k−r }, belongs to A.
Therefore, A contains every set of size k that contains the element xk . Since this is a
n−1
maximal intersecting family and has size k−1 , we are done.
5
a graph. The crossing number of G is defined to be the smallest number of crossings (that
is, pairs of edges that intersect) in any drawing of G in the plane. So a planar graph is
a graph that has crossing number 0, and the larger the crossing number, the further the
graph is from being planar.
Let us now investigate the connection between the crossing number of a graph and how
many edges it has.
Proof. Let e and f be the number of edges and faces of G, respectively. Euler’s formula
tells us that n − e + f = 2. We also saw in Example 1.4 that e ≥ 3f /2. It follows that
2 = n − e + f ≤ n − e + 2e/3 = n − e/3. Rearranging gives the lemma.
Corollary 1.9. A graph with n vertices and m edges has crossing number at least m − 3n.
Proof. Let G be such a graph and suppose we have drawn it in the plane. By Lemma 1.8,
there is at least one crossing. Remove one of the two crossing edges and the number of
edges is still greater than 3n − 6, so there is another crossing. We can keep doing that
m − 3n times with the number of edges remaining greater than 3n − 6, so there must be
at least m − 3n crossings in the original drawing of G.
Now comes the averaging argument, which for large m will greatly boost the bound
just proved.
Corollary 1.10. Let G be a graph drawn in the plane with n vertices and m edges with
m ≥ 6n. Then there are at least m3 /72n2 crossings.
Theorem 1.11 (Szemerédi, Trotter). Given n distinct points and m distinct lines in the
plane, the number of incidences is at most 8 max{m, n, m2/3 n2/3 }.
6
Proof. Regard the system of points and lines as a drawing of a graph as follows. The
points are the vertices. For each line L, enumerate the vertices along that line in order as
x1 , . . . , xk , and for each i < k regard the segment of L from xi to xi+1 as an edge.
Let the number of incidences be s. Let the lines be L1 , . . . , Lm and for each i let the
number of points on line Li be si . Then s1 + · · · + sm = s, and the number of edges of
the graph is (s1 − 1) + · · · + (sm − 1) = s − m. Therefore, the number of crossings is, by
Corollary 1.10, at least (s − m)3 /72n2 if s − m ≥ 6n.
However, since two lines cross2 in at most one point, we also know that the number
m
of crossings is at most 2 ≤ m /2. So if s ≥ 6n + m, then (s − m) /72n2 ≤ m2 /2, 3
which implies that (s − m)3 ≤ 36m2 n2 . From this it follows that either s ≤ 6n + 2m ≤
max{8m, 8n} or (s/2)3 ≤ 36m2 n2 . In the second case, s ≤ 8m2/3 n2/3 . The result follows.
Three examples help to make sense of the bound in the theorem. If there is just one
line and it contains all the points, then the number of incidences is n. Similarly, if there is
just one point and it is contained in all the lines, then the number of incidences is m. And
if we take the grid {0, 1, . . . , r − 1} × {0, 1, . . . , 2r2 − 1} and all lines that pass through one
of the points (0, y) with y < r2 and have slope belonging to the set {1, 2, . . . , r}, then there
are 2r3 points and r3 lines, and each line passes through r points, which gives r4 incidences,
equalling the Szemerédi-Trotter bound up to a constant. With a bit more thought one can
generalize these examples and show that the bound is sharp up to a constant for all values
of m and n.
But the antiderivative of log x is x log x − x, so the expression in the middle is equal to
n log n − n + 1. Taking exp of both sides, we deduce that
for every n.
7
Now en log n−n+1 = e(n/e)n , from which we obtain the lower bound. Also, (n + 1)n =
(1 + 1/n)n nn ≤ enn , so
Proof. Let us begin by writing an exact formula for the binomial coefficient in question.
We have
n/2
−n n n! 1.2.3 . . . n 1.3.5 . . . (n − 1) Y 1
2 = n = = = (1 − ).
n/2 2 (n/2)!(n/2)! 2.4.6 . . . n.2.4.6 . . . n 2.4.6 . . . n m=1
2m
Another bound that is often useful when m is a lot smaller than n is the trivial bound
n n(n − 1) . . . (n − m + 1)
= ≤ nm .
m m!
8
Sometimes one wants to be just a little more accurate than this, in which case we can use
the fact that m! ≥ (m/e)m to improve this upper bound to (en/m)m . Note that that is true
whether or not m is small, and indeed it can be useful even when m = αn for a constant
α, as long as that constant is not too close to 1/2 (because then it does not improve on
the trivial bound of 2n ).
n
In a later section we shall give a more accurate bound for m when m = αn for a
constant α independent of n.
Next, it is surprisingly useful to think about ratios of consecutive binomial coefficients.
Directly from the formula one has that
.
n n n−m−1
= .
m+1 m m
In particular, if m = αn then the ratio is roughly (1 − α)/α. From this it follows if α < 1/2
that
n n n n 1 n 1−α
+ + ··· + ≤ = .
0 1 m m 1 − α/(1 − α) m 1 − 2α
The important point here is not the precise estimate, but the fact that if α is bounded
away from 1/2, then to within a constant the sum of the all the binomial coefficients up
n n
to m is the same as m itself. Often we don’t care about constant factors, so then we
n n
can say that the sum of the binomial coefficients up to m is basically m .
To give an idea of what one can do with ratios, we now prove a lemma that shows that
almost all of the discrete cube is concentrated around the middle layers. This is a simple
example of the concentration of measure phenomenon, which we shall see more about later
in the course.
n 2
Lemma 2.4. Let m = (1/2 − θ)n with 0 < θ ≤ 1/2. Then 2−n m ≤ e−θ n/2 .
9
Next, we prove a rather more general estimate. It can itself be considerably generalized,
but even in this form it is useful.
Lemma 2.6. Let X1 , .P. . , Xn be independent random variables of mean zero taking values
2
in [−1, 1] and let X = i Xi . Then P[X ≥ n] ≤ e− n/4 .
Proof. We use a trick that can be used to prove many results of this flavour, which is to
look at the exponential moment EeλX , apply Markov’s inequality, and optimize over λ. We
have
P
EeλX = Eeλ i Xi
Y
=E eλXi
i
Y
= EeλXi
i
λ2 λ3
EeλXi = 1 + EXi2 + EXi3 + . . . .
2! 3!
2 3
If λ ≤ 1, then since |Xi | is always at most 1, the right-hand side is at most 1+ λ2! + λ3! +· · · ≤
2 2
eλ . So EeλX ≤ eλ n .
By Markov’s inequality,
2 n−λn
P[X ≥ n] = P[eλX ≥ eλn ] ≤ eλ .
2 n/4
Choosing λ = /2 we obtain an upper bound of e− .
Remark 2.7. The constant 1/4 in the bound above can be improved, but at the cost of
making the argument a bit more complicated.
2 n/4
Remark 2.8. Replacing each Xi by −Xi , we find that P[X ≤ −n] ≤ e− as well.
Corollary
Pm n2.9. Let > 0, let n be a positive integer, and let m = (1/2 − )n. Then
−n −2 n
2 k=0 k ≤ e .
10
3 Well-separated sets and vectors
Variants of the following question come up frequently when one works on combinatorial
problems. Let A1 , . . . , AN be subsets of [n] of size n/2 and suppose that |Ai ∩ Aj | ≤ αn
for every i 6= j. How large can N be?
The answer turns out to depend quite heavily on α, and in particular on how α compares
with 1/4, the significance of which is that the expected size of the intersection of two random
sets of size n/2 is n/4.
Theorem 3.1. Let α > 1/4. Then there can be exponentially many subsets of [n] of size
n/2 such that no two intersect in more than αn.
Proof. Let A be a random set of size n/2. We begin by estimating the probability that
|A ∩ {1, 2, . . . , n/2}| > αn. Writing m = αn, we can give a formula for this probability,
namely
−1 X n/2 −1 n/2−m−1
n n/2 n/2 n X n/2
n/2
≤2 ,
n/2 r=m+1
r n−r n/2 r=0
r
where the equality follows by replacing r with n/2 − r and using the fact that nk = n−k n
It follows that with non-zero probability the number of bad pairs is at most e−cn N 2 /2.
(Strictly speaking, we are using Proposition 1.1 here.) Therefore, there exist sets A1 , . . . , AN
such that the number of bad pairs is at most e−cn N 2 /2. If we choose N such that this is
at most N/2, we can remove one set from each bad pair and still be left with N/2 sets,
and now there are no bad pairs. Taking N = ecn does the job.
11
can make is that if A and B are subsets of [n], then
n
1A (i)1B (i) = h1A , 1B i.
X
|A ∩ B| =
i=1
However, it is often convenient to associate with a set not its characteristic function but
its balanced function, which is defined as follows. If A is a subset of a finite set X and
|A| = δ|X|, then for each x ∈ X we define fA (x) to be 1 − δ if x ∈ A and −δ if x ∈/ A. We
call this function balanced because it averages zero: indeed Ex f (x) = (1−δ)δ−δ(1−δ) = 0.
Lemma 3.2. Let A and B be two subsets of [n] of size n/2. Let fA and fB be the balanced
functions of A and B. Then hfA , fB i = |A ∩ B| − n/4
Proof.
In particular, if we have sets A and B of size n/2 and their intersection has size at most
(1/4 − δ)n, then hfA , fB i ≤ −δn. This enables us to prove the following rather surprising
result, which is that if α < 1/4, then the largest possible number of sets of size n/2 that
intersect in at most αn is bounded independently of n.
Theorem 3.3. Let A1 , . . . , Am be subsets of [n] of size n/2 and suppose that |Ai ∩ Aj | ≤
(1/4 − δ)n for every i 6= j. Then m ≤ 1 + 1/4δ.
Proof. Write fi for the balanced function fAi . Then, writing k.k for the Euclidean norm
(that is, kf k2 = hf, f i),
m
X
0≤k fi k2
i=1
X
= hfi , fj i
i,j
X X
= kfi k2 + hfi , fj i
i i6=j
where for the last inequality we used Lemma 3.2. The result follows on rearranging.
Of course, this is just (equivalent to) a special case of a more general theorem, which
can be proved in the same way.
Theorem 3.4. Let v1 , . . . , vm be unit vectors in a Euclidean space, let δ > 0, and suppose
that hvi , vj i ≤ −δ for every i 6= j. Then m ≤ 1 + 1/δ.
12
Proof.
X
0≤k vi k2
i
X
=m+ hvi , vj i
i6=j
≤ m − δm(m − 1),
k−1 2 2 1
hvi , vj i = 2
− + 2
=− .
(k + 1) k + 1 (k + 1) k+1
Theorem 3.5. Let x1 , . . . , xm be non-zero vectors in Rn such that hxi , xj i ≤ 0 for every
i 6= j. Then m ≤ 2n, and if m = 2n then there is an orthonormal basis a1 , . . . , an such
that each xi is a multiple of some aj (which implies that amongst the xi there is exactly
one positive multiple and one negative multiple of each aj ).
13
Proof. We prove this by induction on n. The case n = 1 is trivial. Assume now that n > 1
and that m ≥ 2n. Let a1 = x1 /kx1 k and let y2 , . . . , ym be the orthogonal projections of
x2 , . . . , xm to the subspace orthogonal to x1 . Then for each i ≥ 2, yi = xi − hxi , a1 ia1 . If
i 6= j, then
hyi , yj i = hxi , xj i − hxi , a1 ihxj , a1 i.
Since neither of hxi , a1 i and hxj , a1 i is positive, this is at most hxi , xj i, and therefore at
most 0.
We therefore have m−1 vectors y2 , . . . , ym in an (n−1)-dimensional space with hyi , yj i ≤
0 for every i 6= j. Since m − 1 > 2(n − 1), it follows from our inductive hypothesis that
y2 , . . . , ym are not all non-zero. Without loss of generality y2 = 0, which implies that x2 is
a multiple of a1 , and therefore a negative multiple of a1 .
But if x1 is a positive multiple of a1 and x2 is a negative multiple of a1 , then the only
way another vector can avoid having a positive inner product with one of x1 and x2 is if it
is orthogonal to a1 . Therefore, x3 , . . . , xm are at least 2(n − 1) vectors living in the space
orthogonal to a1 , which is (n − 1)-dimensional. By induction we can find an orthonormal
basis a2 , . . . , an of this space with the properties stated in the theorem, and combining this
with a1 creates an orthonormal basis that proves the theorem.
Combining this result with Lemma 3.2 and observing that the balanced function of a
subset of [n] of size n/2 lives in the (n−1)-dimensional subspace orthogonal to (1, 1, . . . , 1),
we deduce immediately that it is not possible to find more than 2(n − 1) sets of size n/2
that intersect in at most n/4. But can we find that many? To do so, we need to find a large
set of orthogonal vectors with ±1 coordinates. (The balanced functions have coordinates
equal to ±1/2, but that is clearly equivalent.)
14
intersection of size n/4. (The fact that the sets have size n/2 follows is equivalent to the
fact that the later rows of Hn are orthogonal to the first row.)
The second construction is algebraic. Recall that if 0 ≤ x < p, then the Legendre symbol
x
p
stands for 1 if x is a quadratic residue, −1 if x is a non-residue, and 0 if x = 0. Let p
be a prime
of the form 4m + 3, let n = p + 1 and define the Paley matrix P = Pn by setting
x−y
Pxy = p if 0 ≤ x, y ≤ p − 1 and x 6= y, −1 if x = y, and 1 if either of x, y is equal to
p. In other words, we first create a p × p matrix by setting Pxy to be 1 or -1 according to
whether x − y is or is not a quadratic residue, considering 0 to be a non-residue for this
purpose, and then we attach to it an extra row and column that consist of 1s.
Here is the Paley matrix P8 . Note that if you ignore the last row and column, then it
is constant down diagonals (or even “mod-7 diagonals”).
−1 1 1 −1 1 −1 −1 1
−1 −1 1 1 −1 1 −1 1
−1 −1 −1 1 1 −1 1 1
1 −1 −1 −1 1 1 −1 1
−1 1 −1 −1 −1 1 1 1
1 −1 1 −1 −1 −1 1 1
1 1 −1 1 −1 −1 −1 1
1 1 1 1 1 1 1 1
The proof that the Paley matrix is orthogonal uses some elementary number theory in a
very pleasant way.
P x+d
Lemma 3.6. For every prime p and every d 6= 0 mod p we have that x xp p
= −1,
where the summation is over all x mod p.
Proof. We shall make use of the fact that the function x 7→ xp is multiplicative.
X x x + d X x2 1 + dx−1
=
x
p p x6=0,−d
p p
X 1 + dx−1
= .
x6=0,−d
p
Now as x ranges over all values other than 0 or d, 1 + dx−1 ranges over all values other than
1 or 0. But half the non-zero elements mod p are quadratic residues and 1 is a quadratic
residue, so this last sum is equal to -1.
15
Proof. The inner product of two distinct rows of the Paley matrix is closely related to the
sum calculated in Lemma 3.6. If neither row is the last, then it is of the form
X x x + d d −d
− − + 1,
x6=0,−d
p p p p
since when x = 0 we replace xp by -1, and the product of the last entries in the two rows
is
1. But p ≡3 mod 4, so -1 is not a quadratic residue mod p, from
which it follows that
d
p
= − −d p
. Therefore, by Lemma 3.6 (and the fact that p0 = 0), the inner product
is −1 + 1 = 0.
If one of the rows is the last row, then we obtain the sum
X x
−1 + + 1.
x6=0
p
Since exactly half the non-zero numbers mod p are quadratic residues, this is also 0.
A ±1 matrix with orthogonal rows is called a Hadamard matrix. The construction
of
Walsh matrices relies on the observation that if H is a Hadamard matrix, then so is
H H
. From that and the construction of Paley matrices, we find that if n is of the
H −H
form 2m (p + 1) for p a prime congruent to 3 mod 4, then there is an n × n Hadamard
matrix. The following question is a famous open problem.
The smallest n that is a multiple of 4 for which no Hadamard matrix is known is 668.
Theorem 4.1. Let A be a set of size n. Then |A + A| ≤ 2n − 1, with equality if and only
if A is an arithmetic progression.
a1 + a1 , a1 + a2 , . . . , a1 + an , a2 + an , . . . , an + an
form a strictly increasing sequence, so they are all distinct. More generally, given any
sequence
(i1 , j1 ), (i2 , j2 ), . . . , (i2n−1 , j2n−1 )
16
such that i1 = j1 = 1, i2n−1 = j2n−1 = n, and each term in the sequence is obtained by
adding 1 to either the x-coordinate or the y-coordinate of the previous term, then all the
resulting sums ait + ajt form a strictly increasing sequence.
It follows that if |A + A| = 2n − 1, then these sequences are all the same. From that it
follows that ai + aj depends only on i + j. From this it follows that the numbers a1 , . . . , an
form an arithmetic progression, since ai−1 + ai+1 = 2ai for every i strictly between 1 and
n.
We also define the product set A.A to be the set {xy : x, y ∈ A}. If A consists of positive
numbers, then by taking the logarithm of those numbers and applying what we have just
proved for sumsets (after checking that the proof goes through unchanged if a1 , . . . , an are
real numbers, which it does), we see that if |A| = n, then |A.A| ≥ 2n − 1, with equality if
and only if A is a geometric progression.
So far, so simple, but we are a short step away from a famous unsolved problem of
Erdős and Szemerédi.
Conjecture 4.2. For every c > 0 there exists n0 such that for every n ≥ n0 and every set
A of positive integers of size n, either A + A or A.A has size at least n2−c .
Note that the conjecture is saying that either the sumset or the product set is almost
as big as it can be, since trivially the maximum size is at most n(n + 1)/2 (and simple
examples show that this maximum can be attained – for example to obtain a large sumset,
take any sequence a1 , a2 , . . . , an of positive integers such that ai ≥ 2ai−1 for every i).
One reason the conjecture is quite plausible is that sets with small sumsets tend to be
related to arithmetic progressions, which have very large product sets, whereas sets with
small product sets tend to be related to geometric progressions, which have very large
sumsets. However, this does not rule out some kind of strange “intermediate” set with,
say, a sumset and a product set of size at most n1.99 , and indeed, almost nothing is known
about the structure of sets with sumsets that are of this sort of size.
For some years, the best known bound for the problem, due to Jozsef Solymosi, was
that, up to logarithmic factors, either the sumset or the product set was of size at least
n4/3 . The exponent 4/3 has subsequently been improved a few times, each time by just
a tiny amount, and the current record, due to Misha Rudnev and Sophie Stevens, stands
at 4/3+2/1167. But these improved bounds require complicated arguments, whereas the
proof of Solymosi, which we give now, is beautifully simple.
We start with a useful definition and an almost trivial, but for the proof very important,
lemma.
Definition 4.3. Let A be a finite subset of R+ . For each x ∈ R+ , write ρ+ A (x) for the
×
number of pairs (a, b) ∈ A such that a + b = x, ρA (x) for the number of pairs (a, b) ∈ A2
2
17
× 2
P
Proof. The multiplicative energy x∈A.A ρA (x) is equal to the number of quadruples
4
(a, b, c, d) ∈ A such
P that ab÷= cd, which is the number of quadruples such that a/c = d/b,
2
which is equal to x∈A/A ρA (x) .
Solymosi deduces his sum-product theorem from a more precise statement, which is
that for every set A of positive real numbers, |A.A||A + A|2 ≥ |A|4 /4dlog |A|e. This he
proves by obtaining both an upper bound and a lower bound for the multiplicative energy
of A. The lower bound is easier, so we give it first. We shall use the following inequality,
which is immensely useful throughout combinatorics. Because it is so important, we give
two proofs.
Lemma 4.5. Let a1 , . . . , an be real numbers. Then i |ai |2 ≥ n−1 ( i |ai |)2 .
P P
The left-hand side is clearly non-negative, so (EX)2 ≤ EX 2 . Now apply this factP
to the ran-
−1 −2
domP variable that takes the value |ai | with probability n to deduce that n ( i |ai |)2 ≤
−1 2
n i |ai | .
Lemma 4.6. Let A be a set of positive real numbers. Then the multiplicative energy of A
is at least |A|4 /|A.A|.
18
Proof. We begin by applying another trick that is extremely useful in a number of contexts.
It applies when we have a function defined on a finite set and have difficulty handling it
because it is not sufficiently close to being constant. In such a situation, if the values are
not too spread out and we are prepared to sacrifice a logarithmic factor, we can simply
partition the domain according to the values taken up to the nearest power of 2. This
process is called dyadic decomposition.
Here every value of ρ÷ −1
A (m ) is between 1 and |A|, so if we divide the set {1, 2, . . . , |A|}
up into intervals of the form {x : 2k−1 ≤ x < 2k }, we need at most dlog2 |A|e of those
intervals. It follows by averaging that there exists one of the intervals, call it I, such that
X 1 X
ρ÷ −1 2
A (m ) ≥ ρ÷ −1 2
A (m ) .
dlog2 |A|e
ρ÷
A (m
−1 )∈I m∈A/A
We shall now concentrate our efforts on finding an upper bound for the left-hand side.
Let us think a bit about what ρ÷ −1 2
A (m ) means. The number of (a, b) ∈ A × A such
that a/b = m−1 is the number of points in the intersection of A × A with the line y = mx.
(Note that I am writing A × A for the Cartesian product of A with itself – it does not
stand for the product set.) Therefore, the square of this number is the number of pairs
of such points. That is not very interesting as it stands, but now that we know that the
lines intersect A × A in sets of roughly the same size, we can make it more interesting by
observing that up to a factor of 2 it is equal to the number of pairs of points that can
be obtained by taking one point from one line and one point from another. And if the
points in one line are p1 , . . . , pr and the points in another line are q1 , . . . , qs , then because
the lines have different gradients, the points pi + qj are distinct and they all belong to
(A × A) + (A × A) = (A + A) × (A + A). (Note that this × is a Cartesian product: A × A
is not to be confused with A.A).
Let m1 , . . . , mt be the elements of A/A, in increasing order, for which ρ÷ −1
A (m ) ∈ I,
and for each i let Li be the line y = mi x and let Bi = Li ∩ (A × A). Then the sets Bi + Bi+1
are disjoint (because Bi + Bi+1 lies between Li and Li+1 and the gradients of the Li are in
increasing order) and contained in (A+A)×(A+A). Also, for each i, |Bi +Bi+1 | ≥ |Bi |2 /2,
since |Bi + Bi+1 | =P |Bi ||Bi+1 | and all the Bi have the same size up to a factor of 2.
It follows that t−1 2 2
i=1 |Bi | ≤ 2|A + A| . This is not quite what we want, because we
2
would prefer to include |Bt | on the left-hand side. There is a small trick that deals with
this problem, which is to consider the vertical line that passes through the leftmost point
of Bt , which is a point (a, mt a) for some a ∈ A. The intersection of this line with A × A
contains all points (a, mt b) such that (b, mt b) ∈ Bt . Setting Bt+1 to be this set of points,
we then have that |Bt+1 | = |Bt |, that |Bt+1 + Bt | = |Bt |2 , and that Bt+1 + Bt is disjoint
from the other sets Bi + Bi+1 . We may therefore conclude that
X t
X
ρ÷ −1 2
A (m ) = |Bi |2 ≤ 2|A + A|2 ,
ρ÷
A (m
−1 )∈I i=1
19
We have more or less finished the argument.
|A|4
|A.A||A + A|2 ≥ ,
2dlog |A|e
Proof. By our upper and lower bounds for the multiplicative energy, we find that
|A|4
2|A + A|2 dlog |A|e ≥ .
|A.A|
20
5.1 The Borsuk-Ulam theorem and some variants
The Borsuk-Ulam theorem is the following statement.
Theorem 5.1. Let f : S n → Rn be a continuous function. Then there exists x ∈ S n such
that f (x) = f (−x).
It is equivalent to the following result, which is itself sometimes referred to as the Borsuk-
Ulam theorem.
Theorem 5.2. Let A1 , . . . , An+1 be closed subsets of S n with union equal to S n . Then
there exist x ∈ S n and i ∈ {1, 2, . . . , n + 1} such that x ∈ Ai and −x ∈ Ai .
Proof of equivalence. Assume Theorem 5.1, and let A1 , . . . , An+1 are closed subsets of S n
with union S n . Define a function f : S n → Rn+1 by f (x) = (d(x, A1 ), . . . , d(x, An )). Note
that the image of f lies in the subset of Rn+1 that consists of all vectors with non-negative
coordinates such that at least one coordinate is zero, the second property following from
the fact that the sets Ai cover S n . This set is homeomorphic n
Pto R , as can be seen by
taking the orthogonal projection from it to the subspace {y : i yi = 0}.
By Theorem 5.1, there therefore exists x such that f (x) = f (−x). But since the union
of the Ai is S n , there exists i such that x ∈ Ai , and therefore d(x, Ai ) = 0, and therefore
d(−x, Ai ) = 0. Since Ai is closed, this implies that −x ∈ Ai .
Conversely, suppose we can find a continuous function f : S n → Rn for which f (x) is
never equal to f (−x). Define a continuous function g : S n → S n−1 by
f (x) − f (−x)
g(x) =
kf (x) − f (−x)k2
and note that g(−x) = −g(x) for every x. Now there exist n closed sets B1 , . . . , Bn+1 that
cover S n−1 without any of them containing a pair of antipodal points. For example, take
a regular n-dimensional simplex and project its faces on to the sphere in the obvious way.
Since each face is part of an affine hyperplane that does not contain 0, its projection lies
strictly inside a hemisphere and therefore does not contain a pair of antipodal points. But
then the n + 1 sets g −1 (Bi ) are closed and cover S n , and do not contain a pair of antipodal
points (because g(−x) = −g(x)), contradicting Theorem 5.2.
This in turn implies (and as we shall soon see, is equivalent to) the same statement but
about open sets.
Theorem 5.3 (Open-sets version). Let A1 , . . . , An+1 be open subsets of S n with union
equal to S n . Then there exist x ∈ S n and i ∈ {1, 2, . . . , n + 1} such that x ∈ Ai and
−x ∈ Ai .
Theorem 5.2 implies Theorem 5.3. For each i let fi (x) = d(x, Aci ) and let f (x) = maxi fi (x).
Then each fi is continuous, and therefore so is f . Also, since every x belongs to some Ai
and every Ai is open, f (x) > 0 for every x. By compactness, it follows that there is some
> 0 such that f (x) ≥ for every x.
21
Now for each i define Bi to be {x : d(x, Aci ) ≥ }. Then each Bi is a closed subset of Ai
and the union of the Bi is S n . Therefore, by Theorem 5.2 we can find i and x such that
x ∈ Bi and −x ∈ Bi , which proves the result.
We now show that the open-sets version implies the following “mixed” version.
Theorem 5.4 (Open or closed version). Let A1 , . . . , An+1 be subsets of S n , each of which
is open or closed, with union equal to S n . Then there exist x ∈ S n and i ∈ {1, 2, . . . , n + 1}
such that x ∈ Ai and −x ∈ Ai .
Theorem 5.3 implies Theorem 5.4. We prove this by induction. Suppose that A1 , . . . , At
are closed and At+1 , . . . , An+1 are open. If t = 0 then the result is Theorem 5.3 so we
are done by hypothesis. Otherwise, suppose that At does not contain a pair of antipodal
points. Define a function f by f (x) = max{d(x, At ), d(−x, At )}. Then f is continuous
and never takes the value zero, so is bounded below by some > 0. Write (At ) for the
set {x ∈ S n : d(x, At ) < }. Then (At ) is open, contains At and does not contain a
pair of antipodal points. But then by our inductive hypothesis at least one of the sets
A1 , . . . , At−1 , At+1 , . . . , An+1 contains a pair of antipodal points and we are done.
Note that this last result includes the case where all the sets are closed, so all the results
we have proved so far are equivalent. It may look as though the last theorem is of interest
principally as an inductive step useful for getting from the all-open version to the all-closed
version, but actually we shall be using one of the intermediate cases.
To get a more intuitive feel for why the mixed result is true, it is helpful to consider
the case d = 1 and assume that the sets C1 and C2 are both intervals, or rather circular
arcs. If they could be half-open intervals, then it would be easy to ensure that neither
contained a pair of antipodal points – for example, if we identify the unit circle with the
interval [0, 2π) in the usual way, then C1 could be the interval [0, π) and C2 the interval
[π, 2π). But if one of the intervals is required to contain both end points, then we can no
longer do this, or anything like it.
22
will necessarily be a pair of antipodal points that have the same colour (or more precisely
have a colour in common).
By slightly modifying this statement, we can remove any topological requirements on
the colours.
Theorem 5.5. Let δ > 0 and define a graph on S d by joining vectors u and v if hu, vi <
−1 + δ. Then this graph has chromatic number at least d + 2.
Proof. If u and v are unit vectors, then ku+vk2 = 2+2hx, yi. It follows that hu, vi < −1+δ
if and only if ku + vk2 < 2δ (which is saying that the distance from u to −v is less than
√
2δ). p
Now let C1 , . . . , Cd+1 be sets that partition S d , let = δ/2, and for each i define
Ui to be the set of all points x such that d(x, Ci ) < . Then Ui is open for each i, so by
Theorem 5.3 there exists i such that Ui contains a pair x, −x of antipodal points. By the
definition of Ui , Ci contains points u, v with ku√− xk < and kv + xk < , which implies
by the triangle inequality that ku + vk < 2 = 2δ, which as we saw above implies that u
and v are joined by an edge. Therefore, there is no proper colouring with fewer than d + 2
colours.
Remark 5.6. If δ is small enough, then the bound above is sharp. We have basically seen
a construction that demonstrates this in the previous subsection: we can take d + 2 unit
vectors u1 , . . . , ud+2 in Rd+1 that form the vertices of a regular simplex, and let Ci be the
set of x such that hx, ui i > θ. Provided θ is at most about d−1 , these sets cover all of S d ,
and the largest distance between two vectors in Ci is slightly less than 2. I’ll leave it as an
exercise to work out the details more precisely.
Remark 5.7. If δ is small, then two points that are joined to each other must be almost
antipodal. This clearly implies that any odd cycle in the graph must be quite long. (I’ll
leave it as another exercise to prove a precise statement along these lines.) In particu-
lar, this construction gives us a big supply of graphs with no short odd cycles and high
chromatic number.
23
the time, came up with the following even shorter proof. He began by observing Theorem
5.4, the “mixed” version of the Borsuk-Ulam theorem we proved in the previous section.
And then came the crazily short proof of Lovász’s theorem.
Proof. We are free to choose any set of size 2n + k as the ground set, so let us choose 2n + k
points in general position in the sphere S k , which we regard as a subset of Rk+1 , where
the key property we require is that no k + 2 points are contained in any proper subspace
of Rk+1 . Let E be the set of points chosen.
Let C1 , . . . , Ck+1 be sets that partition E (n) , and let φ : [2n + k] → S k be a map whose
image is in general position. In particular, the property we need is that no k + 2 points of
the image should lie in a proper subspace of Rk .
For each i, let Xi ⊂ S k be the set of all x such that the open hemisphere H(x) = {y :
hx, yi > 0} contains a set A that belongs to Ci . Also, let X0 = S k \ (X1 ∪ · · · ∪ Xk+1 ). The
sets Xi with 1 ≤ i ≤ k + 1 are open, so X0 is closed. Therefore, by Theorem 5.4 some Xi
contains a pair of antipodal points.
If this happens for i ≥ 1, then since H(x) and H(−x) are disjoint, we find that Ci
contains two disjoint sets and we are done. If it happens for i = 0, then the hemispheres
H(x) and H(−x) do not contain any sets of size n, so there must be at least k + 2 points
in the subspace {y : hx, yi = 0}, which contradicts the fact that the points are in general
position.
24
{r + 1, . . . , n} in the unique order-preserving way. Then σ cannot contain 2413, since in
order to find x2 < x3 such that σ(x2 ) > σ(x3 ) it is necessary to take x2 ∈ B and x3 ∈ A.
But then x1 cannot be in A since we would have σ(x1 ) < σ(x3 ), and x4 cannot be in B for
similar reasons. But that implies that σ(x1 ) > σ(x4 ).
The number of permutations we have just constructed is of size 2n , since the set A
determines the permutation uniquely. However, these are not the only ones: they were
enumerated exactly in a paper of Miklós Bóna in 1997. Nevertheless, it turns out that
there are only exponentially many. Since n! is around (n/e)n , so superexponential, this
shows that avoiding the permutation 2413 is a very strong condition.
The Stanley-Wilf conjecture was that for every permutation π, there exists a constant
C = C(π) such that the number of permutations in Sn that avoid π is at most C n for
every n. It was posed in the late 1980s, received a lot of attention, and was finally solved
by Adam Marcus and Gábor Tardos in 2004 with a remarkably simple argument.
Actually, Marcus and Tárdos solved a different problem that was known to imply the
Stanley-Wilf conjecture, so let us first discuss that problem and its solution and then see
how the implication works.
The problem solved by Marcus and Tárdos concerns a closely related notion of contain-
ment that concerns 01-matrices. We say that an n × n matrix A contains a k × k matrix
B if we can find x1 < · · · < xk and y1 < · · · < yk such that Axi yj = 1 whenever Bij = 1.
Two important things to note about this definition are that we do not insist that Axi yj = 0
if Bij = 0, and that the order matters. The second point is particularly important, so let
us look at it a different way. We can regard A and B as adjacency matrices of bipartite
graphs G(A) and G(B) (so for example G(A) has two vertex sets X, Y that are copies of
[n], with i ∈ X joined to j ∈ Y if and only if Aij = 1). Then the condition that A contains
B is saying more than that G(A) contains G(B) as a subgraph: it is saying that there is an
embedding of G(B) into G(A) in such a way that the orderings of the two sets of vertices
are preserved.
1 0 1 0 0 0
0 1 0 0 0 1 1 0 1
As an an example, the matrix 1 0 0 0 0 0 contains the matrix 0 0 1 , as
0 0 0 0 1 1 0 1 0
0 0 0 1 1 0
is shown by the entries that are highlighted in red.
As with permutations, we say that a matrix A avoids a matrix B if it does not contain
B.
Of particular interest is the case where B is a permutation matrix. We can ask how
many n × n matrices avoid B, but actually the question Marcus and Tárdos looked at was
how many non-zero entries such a matrix can have. It had been conjectured by Füredi and
Hajnal that for any given B the number was at most linear in n. This was the conjecture
that Marcus and Tárdos proved.
Theorem 6.1 (Marcus, Tárdos). Let P be a permutation matrix. Then there exists a
constant C = C(P ) such that every n × n 01-matrix that avoids P has at most Cn non-
25
zero entries.
Before we move on to the proof, let us briefly look in a slightly different way at what it
means to contain a permutation matrix. Define an ordered partition of [n] to be a partition
into sets X1 , . . . , Xm such that if i < j, then every element of Xi is less than every element
of Xj .
A matrix A contains a permutation matrix P ∈ Sk if and only if there exist ordered
partitions R1 , . . . , Rk and C1 , . . . , Ck of the rows and columns of A such that if Pij = 1 then
there is a non-zero entry in the submatrix of A given by the rows in Ri and the columns
in Cj . To illustrate this with a diagram,
1 1 0 0 1
1 0 0 0 0 0 0 1
0 0 1 is contained in 0 1 0 1 0
0 1 0 0 0 1 0 1
0 0 0 0 1
because each of the red blocks contains at least one 1. (Note that other blockings would
have worked just as well.) Thus, we can think of matrix containment as saying that the
small matrix is a “subset of a quotient” of the big matrix. We shall see that blockings like
this play an important part in the proof of Marcus and Tárdos.
Let us make this observation more precise. Let A be an n × n 01-matrix, let R1 , . . . , Rk
and C1 , . . . , Ck be as above, and let B be the k × k matrix obtained by setting Bij = 1 if
and only if there is a 1 in the block Ri × Cj – that is, a 1 whose row belongs to Ri and
whose column belongs to Cj . We shall call such a matrix B an ordered quotient of A.
Lemma 6.2. Let P be a permutation matrix and let A be an 01-matrix that avoids P .
Then every ordered quotient of A avoids P .
Proof. Let σ ∈ Sk , let P be the corresponding permutation matrix (that is, Pij = 1 if
j = σ(i) and 0 otherwise), and suppose that A has an ordered quotient B that contains
P . Let R1 , . . . , Rt and C1 . . . , Ct be the ordered partitions that determine the quotient.
Let x1 < · · · < xk and y1 < · · · < yk be such that Bxi yσ(i) = 1 for i = 1, 2, . . . , k, and for
each i pick an entry Ari si = 1 such that ri ∈ Rxi and si ∈ Cyσ(i) . Since x1 < · · · < xk ,
y1 < · · · < yk and the partitions are ordered, it follows that r1 < · · · < rk and s1 < · · · < sk .
Therefore, A contains P .
Remark 6.3. If that proof did not seem trivial to you, then I recommend drawing a
diagram, or looking at the diagram of the block matrix given above.
Now let us start the proof in earnest. Let us fix a k × k permutation matrix P and
for each positive integer n let f (n) be the largest possible number of non-zero entries in
an n × n 01-matrix that does not contain P . For convenience, let us assume for the time
being that n is divisible by k 2 . This allows us to take ordered partitions R1 , . . . , Rn/k2 and
C1 , . . . , Cn/k2 of the rows and columns, with each Ri and Cj of size k 2 .
26
Given an n × n 01-matrix A with k 2 |n, let us write Aij for the submatrix obtained by
restricting to the rows in Ri and columns in Cj . Call these k 2 × k 2 submatrices blocks.
The key to the proof is the following definition. We shall call a block wide if at least k
different columns of the block contain a 1, and tall if at least k different rows contain a 1.
The structure of the proof will be to show that if A does not contain P , then there cannot
be too many wide blocks, or too many tall blocks, or too many non-empty blocks that are
neither wide nor tall.
Lemma 6.4. If A does not contain a given k × k permutation matrix P , then for each
k2
j the number of wide blocks with columns in Cj is at most (k − 1) k and for each i the
2
number of tall blocks with rows in Ri is at most (k − 1) kk .
2
Proof. If there are more than (k − 1) kk wide blocks with columns in Cj , then by the
pigeonhole principle we must be able to find a set K of k columns in Cj and a set B of
k blocks such that every block in B contains a 1 in every column in K. But then the
submatrix of A that is formed out of the blocks in B (placed one on top of another) has
a k × k ordered quotient with 1s everywhere: we split the rows according to which block
of B they belong to, and we split the columns in such a way that each of the chosen k
columns belongs to a different cell of the partition. Therefore, A contains the k × k all-1s
matrix, which trivially implies that it contains P .
The same argument, interchanging the roles of rows and columns, proves the statement
for tall blocks.
Lemma 6.4 allows us to prove a recursion for the number of entries that A can have.
Corollary 6.5. Let P be a k × k permutation matrix, let n be divisible by k 2 , and let f (n)
be the largest number of entries of any n × n 01-matrix that does not contain P . Then
2
2 k
f (n) ≤ 2k n(k − 1) + (k − 1)2 f (n/k 2 ).
k
2
Proof. The number of wide blocks is at most kn2 (k − 1) kk , by Lemma 6.4, and each such
block contains at most k 4 1s. The same is true of tall blocks. So the number of 1s contained
in wide or tall blocks is at most the first term in the expression above.
Let B be the ordered quotient of A defined by R1 , . . . , Rn/k2 and C1 , . . . , Cn/k2 . Then by
Lemma 6.2, B does not contain P . It follows that B has at most f (n/k 2 ) non-zero entries.
So the number of non-zero blocks of A is at most f (n/k 2 ). We have already counted the
entries that belong to wide or tall blocks. Any other block has at most (k − 1)2 non-zero
entries, so the number of entries not so far counted is at most (k − 1)2 f (n/k 2 ). This proves
the result.
The interesting part of the proof is over: it remains to obtain an upper bound for f (n)
from this recursion.
27
Proof of Theorem 6.1. If P is a k × k permutation matrix, we shall prove by induction on
2
n that f (n) ≤ 2k 4 n kk .
If n ≤ k 2 , then trivially f (n) ≤ k 4 ≤ k 4 n, so we are done.
Otherwise, let m be the largest multiple of k 2 that is less than n. Then, again trivially,
f (n) ≤ f (m) + 2k 2 n. And by our inductive hypothesis and Corollary 6.5,
2 2
2 k 2 4m k
f (m) ≤ 2k (k − 1)m + (k − 1) .2k 2 .
k k k
But
2k 2 (k − 1) + (k − 1)2 .2k 2 = 2k 3 (k − 1),
so putting this together we find that
2 2
4 3 k 2 4 k
f (n) ≤ 2(k − k )m + 2k n ≤ 2k n .
k k
Remark 6.6. It is not immediately obvious how to find a matrix with more than (k − 1)2 n
non-zero entries that does not contain every k × k permutation matrix, and for a while it
was conjectured that this should be the right bound. However Jacob Fox, in a very nice
paper, proved that in fact the constant has to be exponential in k for almost all k × k
permutation matrices. So the use of the pigeonhole principle above, which looks quite
inefficient, is in fact not too wasteful.
Corollary 6.7. For every permutation π ∈ Sk there is a constant C such that for every n
there are at most C n permutations σ ∈ Sn that avoid π.
Proof. Let P be the matrix of π and let α be such that every n × n 01-matrix with at least
αn non-zero entries contains P . For each n let T (n) be the number of n × n 01 matrices
that avoid P . We claim that T (2n) ≤ 15αn T (n).
To see this, observe first that if we take ordered partitions into sets of size 2 of the
rows and columns of a 2n × 2n matrix that avoids P , then the number of distinct quotient
matrices we can obtain is at most T (n), by Lemma 6.2. But each such quotient matrix has
at most αn non-zero entries, so the number of matrices with quotient equal to any given
n × n matrix is at most 15αn (since there are four entries to choose in each non-empty
block and they cannot all be zero). The recursion follows.
It follows immediately by induction that if n is a power of 2, then T (n) ≤ 15αn . It is
also easy to see that the number of permutations in Sn that avoid π is a non-decreasing
function of n, so for general n we may deduce that T (n) ≤ 152αn .
28
7 Entropy arguments
This section is slightly different from previous ones in that we shall need to develop a little
bit of theory, and only after that will the arguments be very short. However, the theory
consists of a few basic statements, and what really matters is not the proofs of those
statements, but how to use them. To put it another way, I recommend treating those
statements more like axioms than lemmas. To encourage this, I shall do so myself, but
just to give you something to hold on to, which makes some of the axioms more intuitive,
you should think of the entropy H[X] of a discrete random variable X as a real number
that measures the “information content” of X. Roughly speaking, this is how many bits
of information you gain, on average, if you find out the value of X, or equivalently, the
expected number of bits needed to specify X.
7.1 The Khinchin axioms for entropy and some simple conse-
quences
Entropy has the following properties, which are called the Khinchin (or Shannon-Khinchin)
axioms. (I got some of these from a set of lecture notes by Cosma Shalizi, which I recom-
mend for further discussion of the pros and cons of an axiomatic approach.)
0. Normalization. If X takes the values 0 and 1, each with probability 1/2, then
H[X] = 1.
4. Additivity. For any two random variables X and Y , H[X, Y ] = H[X] + H[Y |X],
where X
H[Y |X] = P[X = x]H[Y |X = x].
x
Axiom 0 is not really one of Khinchin’s axioms, but the remaining axioms determine H
only up to a multiplicative constant so it is there to fix that constant to a convenient value.
Axioms 1-3 and 5 are rather basic properties of a kind that one might expect, but axiom
4 needs more comment. The quantity H[X, Y ] is simply the entropy (whatever that will
turn out to mean) of the joint random variable (X, Y ). The quantity H[Y |X] is called the
29
conditional entropy of Y given X: it is the average entropy of Y given the value of X.
Note that from its definition and the fact that H takes non-negative values (which implies
that H[Y |X = x] is non-negative for each x), it follows that H[Y |X] is non-negative.
We now prove a sequence of lemmas, most of them very simple.
Lemma 7.1. If X and Y are independent, then H[Y |X] = H[Y ] and H[X, Y ] = H[X] +
H[Y ].
Proof. For each x the distribution of Y given that X = x is the same as the distribution of
Y , so H[Y |X = x] = H[Y ] for every x, by the invariance axiom. (The reason that axiom
is needed is that strictly speaking the random variable Y |X = x takes values of the form
(x, y), where y is a value taken by Y .) It follows that
X X
H[Y |X] = P[X = x]H[Y |X = x] = P[X = x]H[Y ] = H[Y ].
x x
Proof. H[X, X] = H[X], by the invariance axiom. But X and (X, X) are independent, so
H[X, X] = 2H[X], by Lemma 7.1.
And another.
where the inequality follows from what we have just proved (together with the invariance
axiom). Since H[X] ≥ 1 (again by what we proved above), it follows that H[X] < H[Y ].
And another.
Lemma 7.4. Let X be a random variable and let Y = f (X) for some function f . Then
H[Y ] ≤ H[X].
30
Proof. By the invariance axiom, H[X] = H[X, Y ], since there is a bijection between values
x taken by X and values (x, f (x)) taken by (X, Y ). Therefore, by the additivity axiom,
H[X] = H[Y ] + H[X|Y ].
In an earlier version of these notes, I assumed that H was non-negative, having failed
to see a proof of non-negativity from the axioms. However, Sean Eberhard (a postdoc at
Cambridge) pointed out to me the following argument.
Lemma 7.5. H[X] ≥ 0 for every discrete random variable X that takes values in a finite
set A.
Proof. First let us suppose that there exists n such that pa = P[X = a] is a multiple of
n−1 for every a ∈ A. Let Y be uniformly distributed on [n], let (Ea : a ∈ A) be a partition
of [n] such that |Ea | = pa n for each a ∈ A, and let Z be the random variable where Z = a
if Y ∈ Ea . Then Z and X are identically distributed, so H[Z] = H[X], by invariance.
Now H[Y, Z] = H[Z] + H[Y |Z], by additivity. Also, H[Y, Z] = H[Y ] by Lemma 7.4,
since (Y, Z) depends only on Y . And finally, for each a ∈ A, H[Y |Z = a] is uniformly
distributed on a set of size at most n, so by Lemma 7.3 it follows that H[Y |Z = a] ≤ H[Y ].
This implies that H[Y |Z] ≤ H[Y, Z], and therefore that H[X] = H[Z] ≥ 0.
In the general case, since we can approximate the probabilities pa arbitrarily closely by
multiples of n−1 for a suitably large n, we can apply the continuity axiom to obtain the
same conclusion.
Here is a slightly less simple lemma.
Lemma 7.6. Let X be a random variable that takes at least two values with non-zero
probability. Then H[X] > 0.
Proof. Let A be the set of values taken by X, let α = maxa∈A P[X = a], and for each n
denote by X n the An -valued random variable that is given by n independent copies of X.
Then the maximum probability of any value taken by X n is αn . Since α < 1, for any > 0
there exists n such that αn < . It follows that we can partition An into two sets E and
F , each of which has probability between 1/2 − and 1/2 + . Now let Y be a random
variable that takes the value 0 if X n ∈ E and 1 if X n ∈ F . Then H[X n ] = nH[X], by
Lemma 7.1 (and induction), and also H[X n ] = H[Y ]+H[X n |Y ] ≥ H[Y ]. But H[Y ] > 0 for
sufficiently small , by the normalization axiom and continuity. It follows that H[X n ] > 0
and therefore that H[X] > 0.
We end this sequence of lemmas with a result that is often useful. It is sometimes
known as the chain rule for entropy.
Lemma 7.7. Let X1 , . . . , Xk be random variables taking values in a set A. Then
H[X1 , . . . , Xk ] = H[X1 ] + H[X2 |X1 ] + H[X3 |X1 , X2 ] + · · · + H[Xk |X1 , . . . , Xk−1 ].
Proof. By additivity,
H[X1 , . . . , Xk ] = H[X1 , . . . , Xk−1 ] + H[Xk |X1 , . . . , Xk−1 ].
The result therefore follows by induction, with the additivity axiom as the base case.
31
7.2 The number of paths of length 3 in a bipartite graph
Let us now consider the following problem. Suppose that G is a bipartite graph with
finite vertex sets A and B and density α. (The density is defined to be the number of
edges divided by |A||B|.) A labelled P 3 is a quadruple (x1 , y1 , x2 , y2 ) such that x1 , x2 ∈ A,
y1 , y2 ∈ B, and x1 y1 , y1 x2 , and x2 y2 are all edges of G. In other words, it is a path of length
3 in the graph, but we allow degeneracies such as x1 = x2 .
How many labelled P 3s must a bipartite graph with density α contain? If G is bi-
regular, meaning that every vertex in A has degree α|B| and every vertex in B has degree
α|A|, then there are α3 |A|2 |B|2 , since we can choose x1 in |A| ways, then y1 in α|B| ways,
then x2 in α|A| ways, and finally y2 in α|B| ways. We shall now show that this is the
smallest number of labelled P 3s that there can be. The proof will assume that entropy
exists – that is, that there is some H that satisfies the Khinchin axioms. Later we shall
see that this assumption is valid, and that will put in the final piece of the jigsaw.
Before you read the proof, I strongly recommend you try to prove the result for yourself
by elementary means, since it looks as though it ought to be possible (given that the result
is true), but turns out to be surprisingly tricky, and if you haven’t experienced the difficulty,
then you won’t appreciate the power of the entropy approach.
How does one use entropy to prove results in combinatorics? Part of the answer lies in
axiom 2. Suppose that X is uniformly distributed on a set of size k, and Y is a random
variable with H[Y ] ≥ H[X]. then axiom 2 implies that Y takes at least k different values.
Therefore, if we want to prove that a set A has size at least k, one way of doing it is to
find a random variable that takes values in A and has entropy at least H[X].
At first this may seem a very strange approach, since by axiom 2 we know that if there
is such a random variable, then a random variable that is uniformly distributed on A will
also work. If we define f (n) to be the entropy of a random variable that is uniformly
distributed on a set of size n, then all we seem to be doing is replacing the cardinality of
a set by f of that cardinality, which doesn’t look as though it will achieve anything.
However, there is a flaw in that criticism, which is that it might in principle be easier
to obtain a lower bound for the entropy of a carefully chosen distribution on a set A (given
certain assumptions about A) than it is to find a lower bound on the cardinality of A. And
indeed, this turns out to be the case in many interesting situations, including the one at
hand.
We wish to obtain a lower bound for the number of labelled P 3s in a bipartite graph G
of density α, and to do so we shall obtain a lower bound for the entropy of the following
distribution on the set of labelled P 3s, which is not uniform (except when the graph is
regular). We choose an edge x1 y1 (with x ∈ A and y ∈ B uniformly at random, then a
vertex x2 uniformly from the neighbours of y1 , and then a vertex y2 uniformly from the
neighbours of x2 .
Let X1 , Y1 , X2 , and Y2 be the distributions of x1 , y1 , x2 , and y2 , respectively. We now
wish to say something about the entropy H[X1 , Y1 , X2 , Y2 ]. The chain rule (Lemma 7.7)
tells us that it is equal to
H[X1 , Y1 ] + H[X2 |X1 , Y1 ] + H[Y2 |X1 , Y1 , X2 ].
32
Now XX
H[X2 |X1 , Y1 ] = P[X1 = a, Y1 = b]H[X2 |X1 = a, Y1 = b].
a∈A b∈B
But for each fixed b, the distributions of X1 and X2 given that Y1 = b are independent:
the way we choose x2 once we have chosen y1 depends entirely on y1 and not on how y1
was obtained. Therefore, this simplifies to
XX X
P[X1 = a, Y1 = b]H[X2 |Y1 = b] = P[Y1 = b]H[X2 |Y1 = b]
a∈A b∈B b∈B
= H[X2 |Y1 ].
Notice that the first three terms are the entropies of the distributions of the three edges of
the random labelled P 3. We now make an important observation.
Lemma 7.8. Given a random labelled P 3 from the distribution defined above, the three
edges are all uniformly distributed over all edges.
Proof. The first edge is uniformly distributed by the definition of the distribution. Now
the number of edges is α|A||B| and the number of edges x1 y1 with y1 = b is d(b) (the degree
of b), so the probability that Y1 = b is d(b)/α|A||B|, and the probability that X2 = a given
that Y1 = b is 0 if ab is not an edge and d(b)−1 if ab is an edge. So the probability that
(X2 , Y1 ) = (a, b) is 1/α|A||B| whenever ab is an edge, which is another way of saying that
X2 Y1 is uniformly distributed over all edges. And once we know that, then the same proof
shows that X2 Y2 is uniformly distributed.
We also know that H[Y1 ] and H[X2 ] are at most as big as they would be if Y1 and X2
were uniformly distributed. So if we let X be uniformly distributed over A, Y be uniformly
distributed over B, and E be uniformly distributed over all edges, then a lower bound for
the entropy of (X1 , Y1 , X2 , Y2 ) is
33
Consider now the random variable (X1 , Y1 , X2 , Y2 , X, Y ), where (X1 , Y1 , X2 , Y2 ) is as
before, X is a random element of A, and Y is a random element of B, with X and Y
independent of each other and of (X1 , Y1 , X2 , Y2 ). By the above bound and Lemma 7.1
this random variable has entropy at least 3H[E], which is the entropy of the uniform
distribution over all triples of edges (by Lemma 7.1). From this and Lemma 7.3, it follows
that |A||B| times the number of labelled P 3s is at least the cube of the number of edges,
which is α3 |A|3 |B|3 , and from this we get that the number of labelled P 3s is at least
α3 |A|2 |B|2 , as required.
The statement just proved is a special case of the following famous conjecture of
Sidorenko.
Conjecture 7.9. Let G be a bipartite graph with finite vertex sets X and Y and density α.
Let H be another bipartite graph with vertex sets A and B and let φ be a random function
that takes A to X and B to Y . Then the probability that φ(a)φ(b) is an edge of G for every
edge ab of H is at least α|E(H)| .
In rough terms, this can be thought of as saying that if you want to minimize the number
of copies of H in a bipartite graph of density α, then you cannot do better than to pick the
edges of the bipartite graph independently at random with probability α. The conjecture
has been proved for several classes of bipartite graphs, some by entropy methods, but in
general it remains stubbornly open.
34
√
Of course, sometimes we use facts such as that 2 > 1, but those can be deduced from
the above properties and the ordered-field axioms that the reals satisfy.
With those remarks out of the way, here’s the formula. If X is a discrete random
variable taking values in a set A, then, writing pa for P[X = a], we have
X
H[X] = pa log(1/pa ),
a∈A
P
where the logarithm is to base 2. People often write this as − a∈A pa log(pa ), a practice I
don’t like because it has a kind of clever-clever “You thought I was negative but I’m not!”
aspect to it.
A quick example to help with orientation: if XPis uniformly distributed over a set A
of size n, then the formula tells us that H[X] = a∈A n−1 log n = log n. In particular,
if n = 2k , then the entropy is k, which reflects the intuitive idea that we need k bits of
information to specify an element of A.
It is not hard to prove that this function satisfies the axioms given earlier. Normaliza-
tion, invariance, extensibility and continuity are obvious. Maximality is a simple conse-
quence of Jensen’s inequality: the function log x is concave, so if A is any finite set, then
for any random variable X taking values in A, if we write pa for P[X = a], then the pa are
non-negative and sum to 1, so we have
X X
pa log(1/pa ) ≤ log( pa /pa ) = log(|A|),
a∈A a∈A
But XX X
pab log(1/pa ) = pa log(1/pa ) = H[X],
a∈A b∈B a∈A
and
XX X X
pab log(1/P[Y = b|X = a]) = pa P[Y = b|X = a] log(1/P[Y = b|X = a])
a∈A b∈B a∈A b∈B
X
= pa H[Y |X = a]
a∈A
= Ea∈A H[Y |X = a]
= H[Y |X].
35
Thus, H[X, Y ] = H[X] + H[Y |X].
This proves that there is a function that satisfies the entropy axioms, and that completes
the proof of the lower bound for the number of labelled P 3s.
Proof. Let Y be uniformly distributed on a set of size 2. Then H[Y ] = 1 by the normal-
ization axiom, which implies that H[Y k ] = k by Lemma 7.1 and induction. Since Y k is
uniformly distributed on a set of size 2k , H[X] = k as well, by invariance.
Proof. For each r, X r is uniformly distributed on a set of size nr , and H[X r ] = rH[X].
Therefore, if 2k ≤ nr ≤ 2k+1 , then k ≤ rH[X] ≤ k + 1, by Lemma 7.3. It follows that
k/r ≤ H[X] ≤ (k +1)/r whenever k/r ≤ log n ≤ (k +1)/r. This implies that H[X] = log n
as claimed.
Corollary 7.12.
Let X take values in a finite set A, with pa = P[X = a]. Then H[X] =
1
P
a pa log pa .
Proof. First let us assume that there exists n such that pa is a multiple of n−1 for every
a ∈ A. As in the proof of Lemma 7.5 let Y be uniformly distributed on [n], let Y be
partitioned into sets Ea , one for each a ∈ A, with |Ea | = pa n, and let us assume that
X = a if and only if Y ∈ Ea (which by the invariance axiom loses no generality).
Then by additivity, H[Y ] = H[X] + H[Y |X]. But by Lemma 7.11 H[Y ] = log n, and
X X X
H[Y |X] = pa H[Y |X = a] = pa H[Y |Y ∈ Ea ] = pa log(pa n),
a a a
where for the last equality we again applied Lemma 7.11. It follows that
X X 1
H[X] = log n − pa (log pa + log n) = pa log .
a a
pa
As in the proof of of Lemma 7.5 we obtain the general case by applying the continuity
axiom.
36
7.4 Brégman’s theorem
We shall now prove another result using entropy. For this result the final answer is a
number, so the proof is not quite as purely based on the entropy axioms. However, the
only numerical result needed is Lemma 7.11, the relatively simple result that the entropy
of the uniform distribution on a set of size n is log n.
We begin with a definition of an important combinatorial concept.
Theorem 7.14. Let A be an n × n 01-matrix and for each i let ri be the number of 1s in
row i. Then n
Y
per(A) ≤ (ri !)1/ri .
i=1
Note a simple special case: if A consists entirely of 1s, then the right-hand side gives
the nth power of (n!)1/n , which equals n!, which is the number of perfect matchings in
the complete bipartite graph Kn,n . So the boundis sharp in this case. Note also that
A1 0
if A is a matrix that splits into blocks then per(A) = per(A1 ) per(A2 ) and
0 A2
if we write f (A) for the upper bound in Brégman’s theorem, then we also have that
f (A) = f (A1 )f (A2 ). So this gives us a reasonably wide class of matrices for which the
bound is sharp.
We now give a proof, due to Jaikumar Radhakrishnan, that uses entropy. The theorem
has a number of proofs, some quite short, so we do not claim that the use of entropy is
essential: nevertheless it provides an interesting illustration of the technique.
37
Entropy proof of the theorem. Let G be the bipartite graph with bipartite adjacency ma-
trix A: that is, the vertex sets of G are two copies X, Y of [n] and xy is an edge of G if and
only if Axy = 1. Let σ be a perfect matching chosen uniformly at random from all perfect
matchings in G. We shall obtain an upper bound for the entropy H[σ]. (To be clear, σ
is a random variable with values in Sn and H[σ] is the entropy of that random variable.)
Brégman’s
Q theorem is equivalent to the assertion that the number of perfect matchings is
at most x∈X (d(x)!)1/d(x) , where d(x) is the degree of x, which in turn is equivalent to the
inequality
X 1
H(σ) ≤ log(d(x)!).
x∈X
d(x)
We now bound the quantities on the right-hand side, starting with H[σ(x1 )]. We know
that σ(x1 ) must be one of the d(x1 ) neighbours of x1 in G. The probability that any given
neighbour is chosen depends on the matrix A in a horribly complicated way, but since σ(x1 )
can take only d(x1 ) different values, we can at least bound H[σ(x1 )] above by the entropy
of the uniform distribution on a set of size d(x1 ), which by Lemma 7.11 is log(d(x1 )).
Now let us turn to H[σ(x2 )|σ(x1 )]. If we are given σ(x1 ), then we know two things
about σ(x2 ): that it is a neighbour of x2 , and that it is not equal to σ(x1 ). For each x and
σ, let us define dσ1 (x) to be the number of neighbours of x not equal to σ(x1 ). Then an
upper bound for H[σ(x2 )|σ(x1 )] is Eσ log(dσ1 (x2 )).
More generally, for each k, if we are given σ(x1 ), . . . , σ(xk−1 ), then for each x, let dσk−1 (x)
be the number of neighbours of x not equal to any of σ(x1 ), . . . , σ(xk ). Then an upper
bound for H[σ(xk )|σ(x1 ), . . . , σ(xk−1 )] is Eσ log(dσk−1 (xk )), so
n
X
H[σ] ≤ Eσ log(dσk−1 (xk )).
k=1
At this point we seem to be facing a serious difficulty, which is that it is hard to say
anything about the distribution of dσk−1 , and hence hard to bound the expectations we need
to bound. However, we still have a card up our sleeve, mentioned above, which is that we
can average over all orderings x1 , . . . , xn of [n].
Let σ be fixed, and let x1 , . . . , xn be a random ordering. Let x be a vertex in X and
let us think about the contribution x makes to the sum
n
X
log(dσk−1 (xk )),
k=1
38
Let the neighbours of x be y1 , . . . , yd . Then the ordering of the vertices σ −1 (y1 ), . . . , σ −1 (yd ),
which all belong to X, is distributed uniformly amongst all d! possible orderings, and the
position of x (which is σ −1 (yj ) for some j) in this ordering is therefore uniformly distributed
amongst the d possibilities. If x comes in position s in the ordering, and k (which depends
on the ordering) is such that xk = x, then dσk−1 (xk ) = d − s + 1. Thus, for any fixed σ,
dσk−1 (xk ) is uniformly distributed in the set {1, 2, . . . , d(x)}. It follows that
n d(x)
X X 1 X X 1
E log(dσk−1 (xk )) = log(d(x) + s − 1) = log(d(x)!),
k=1 x∈X
d(x) s=1 x∈X
d(x)
Lemma 7.15. Let X and Y be discrete random variables. Then H[X, Y ] ≤ H[X] + H[Y ].
Proof using the formula. Since H[X, Y ] = H[Y ] + H[X|Y ], by the additivity axiom, this
is equivalent to the assertion that H[X|Y ] ≤ H[X], which makes intuitive sense, since
knowing the value of Y shouldn’t increase the amount of information we obtain from X.
To prove it, let A be the set of values taken by X, let B be the set of values taken by Y ,
39
and write pa for P[X = a], qb for P[Y = b], and pa,b for P[X = a, Y = b]. Then
X
H[X|Y ] = qb H[X|Y = b]
b
X pa,b
X qb
= qb log
b a
q b pa,b
X X pa,b
qb
= pa log .
a b
p a p a,b
P pa,b
Now b pa = 1, so by Jensen’s inequality this last expression is at most
!
X X pa,b qb X 1
pa log . = pa log = H[X].
a b
p a pa,b a
pa
Axiomatic proof. First consider the case where Y is uniformly distributed. Let notation
be as in the previous proof. Then
X
H[Y |X] = pa H[Y |X = a]
a∈A
which implies that H[X, Y ] ≤ H[X]+H[Y ]. The general case now follows by continuity.
Although we shall not use it in this course, it is worth knowing a definition that arises
naturally from the above fact. The mutual information I[X; Y ] is defined to be H[X] +
H[Y ] − H[X, Y ]. Lemma 7.15 implies that the mutual information is non-negative. Other
simple observations are that if X = Y (or more generally X determines Y and vice versa)
then I[X; Y ] = H[X], that if Y = f (X), then I[X; Y ] = H[Y ], and that if X and Y are
40
independent, then I[X; Y ] = 0. One can think of the mutual information as being the
amount of information that is “shared” by X and Y . The formula
which follows from the definition, is highly reminiscent of the first case
|A ∪ B| = |A| + |B| − |A ∩ B|
Theorem 7.16 (Shearer’s lemma). Let S be a random subset of [n], chosen according to
some probability distribution, and suppose that for every i, P[i ∈ S] ≥ µ. Then for any
discrete random variables X1 , . . . , Xn ,
µH[X1 , . . . , Xn ] ≤ ES H[XS ],
where if S = {i1 , . . . , ik }, then XS is short for the random variable (Xi1 , . . . , Xik ).
Note that if S is a random singleton, chosen uniformly, then µ can be taken to be n−1 ,
and we conclude that n−1 H[X1 , . . . , Xn ] ≤ Ei H[Xi ], which is just a restatement of the
subadditivity just mentioned.
Proof of Theorem 7.16. As in the statement of the theorem, let us write S = {i1 , . . . , ik },
and let this be done in such a way that i1 < · · · < ik . (Note that S is a random set, so the
elements ij are not fixed elements of [n].)
By the chain rule for entropy (Lemma 7.7),
Write H[Xi |X<i ] for H[Xi |X1 , . . . , Xi−1 ]. Since conditioning on more variables reduces
entropy, the above equality implies that
X
H[XS ] ≥ H[Xi |X<i ].
i∈S
41
Taking expectations, we deduce that
X
ES H[XS ] ≥ ES H[Xi |X<i ]
i∈S
n
X
= P[i ∈ S]H[Xi |X<i ]
i=1
Xn
≥µ H[Xi |X<i ]
i=1
= µH[X1 , . . . , Xn ],
where the second inequality is by the main hypothesis of the theorem and the last equality
is the chain rule again.
We now give a representative application of Shearer’s lemma. It gives a bound for the
following question: if a graph has m edges, how many triangles can it have? It is natural
to guess that the best construction will be a complete graph on n vertices, at least if we
n n n
can find n such that 2 = m. In that case we have 2 edges and 3 triangles. These
are roughly n2 /2 and n3 /6, which would suggest that the number of triangles is at most
something along the lines of (2m)3/2 /6. This is in fact exactly the bound we shall obtain.
But although the result is not unexpected, it is notable that for the proof we do not have
to get our hands in the slightest bit dirty: there are no arguments, for instance, that
“compress” the graph.
Theorem 7.17. Let G be a graph with m edges and t triangles. Then t ≤ (2m)3/2 /6.
Proof. Let T be a triangle chosen uniformly at random from G and let X = (X1 , X2 , X3 )
be some ordering of its vertices. Then X is uniformly distributed on a set of size 6t, so
H[X] = log(6t).
We now let S be a random pair of elements of {1, 2, 3} (with each pair chosen with
probability 1/3. Then we can set µ = 2/3 in Shearer’s lemma, from which we deduce that
2
H[X] ≤ ES H[XS ].
3
It follows (by averaging again) that there is some pair S such that 23 H[X] ≤ H[XS ]. But
the values taken by XS are ordered pairs of vertices that form end-points of edges, so it is
supported on a set of size 2m. It follows that H[XS ] ≤ log(2m). We conclude that
2
log(6t) ≤ log(2m),
3
which is equivalent to the assertion of the theorem.
42
Now for a somewhat more subtle use of Shearer’s lemma. We shall use it to obtain a
bound on the following natural problem. Let G be a family of graphs with vertex set [n].
We say that G is ∆-intersecting if for any two graphs G1 , G2 ∈ G, the intersection G1 ∩ G2
contains a triangle. And now we ask the obvious question: how large can a ∆-intersecting
family be?
As with intersecting families, there is also an obvious construction: just take all graphs
n
that contain some fixed triangle. This gives us a family of size 2( 2 ) /8. In the other
direction, if we just use the fact that any two graphs in a ∆-intersecting family have a
n
non-empty intersection, we get an upper bound of 2( 2 ) /2. (You might ask whether we can
do better by using the fact that the intersection of any two graphs in the family has size
at least 3, but the set of graphs with at least 21 n2 + 2 edges has that property and its size
n
is not that much smaller than 2( 2 ) /2.)
n
It turns out that the bound 2( 2 ) /8 is the best possible – this is a relatively recent result
of David Ellis, Yuval Filmus and Ehud Friedgut – but for a long time the best known bound
n
was 2( 2 ) /4, which we shall prove now, using Shearer’s lemma. This result is due to Chung,
Frankl, Graham, and Shearer himself.
Theorem 7.18. A ∆-intersecting family G of graphs with vertex set [n] has size at most
n
2( 2 ) /4.
Proof. As usual, we let X be an element of G chosen uniformly at random, and our aim
will be to obtain an upper bound for H[X], which will translate directly into an upper
bound for |G|. In the argument that follows, it may help to think of each element of G
as a function from [n](2) to {0, 1} – the characteristic function of the graph – so we are
thinking of X as a random variable (Xe )e∈[n](2) , where Xe = 1 if e belongs to the graph
and 0 otherwise.
To apply Shearer’s lemma, we shall choose S as follows. We first pick a set R ⊂ [n]
uniformly at random, and we let GR be the graph that consists of a clique with vertex
set R and a clique with vertex set [n] \ R. Note that every edge e has a probability 1/2
of being an edge of R. Therefore, if we define XR to be the intersection X ∩ GR , which
corresponds to the random variable (Xe )e∈R , then Shearer’s lemma gives us that
1
H[X] ≤ ER H[XR ].
2
Now we use the fact that G is ∆-intersecting. Since any two graphs in G intersect in at least
a triangle, and since every triangle intersects every graph GR , we find that the restriction
of G to any GR is an intersecting family, and therefore has size at most 2|GR |−1 . Therefore,
H[XR ] ≤ |GR | − 1 for every R. This gives us that
Since each element of [n](2) has a probability 1/2 of belonging to GR , the right-hand side
n
is equal to 2 − 2, and this proves the result.
43
It’s worth reflecting a little on what Shearer’s lemma did for us in the above argument.
It was a little bit similar in spirit to the averaging argument that proved the Erdős-Ko-Rado
theorem, in that we showed that a suitably chosen “random part” of our ∆-intersecting
family was not too large, and then we exploited that to get a bound on the size of the whole
family. However, unlike in Katona’s argument, the “random part” was not the intersection
of the family with a random set (such as the set of intervals in a cyclic order) but rather the
projection of the family on to a random set of coordinates, where each coordinate had an
equal probability of belonging to the set. So Shearer’s lemma tells us that if we can find a
nicely balanced set of projections of a set system and can give good upper bounds for their
sizes, then we can get a good upper bound for the size of the whole family. This general
theme of bounding the size of a set in terms of the size of its projections is a significant
one. (See for example the Loomis-Whitney inequality and its generalizations.)
44
polynomial of degree d to vanish at m given points, a certain set of m linear combinations
of the coefficients must all be zero. By dimension considerations, this is possible if m is
less than the number of coefficients. The result follows.
When d = p − 1, this gives us that for every set A of size less than n+p−1 n+p−1
p−1
= n
we can find a non-zero polynomial of degree d that vanishes on A. Since n+p−1
n
p−1
> p /n!,
n
this is in particular true when |A| ≤ p /n!.
The next lemma is the main idea behind the proof of Dvir’s theorem.
Lemma 8.3. Suppose that A ⊂ Fnp contains a line in every direction, that d < p, and that
there exists a non-zero polynomial f of degree at most d that vanishes on A. Then there is
a non-zero degree-d polynomial that vanishes everywhere on Fnp .
Proof. Without loss of generality the degree of f is exactly d. Let a, z ∈ Fnp with z 6= 0 and
let L be the line consisting of all points a + tz. The restriction of f to L is a polynomial
of degree d in t, and its leading coefficient
Qnis fdr(z), where fd is the degree-d
Qn part of fr.i To
see this, observe that for any monomial i=1 xi its value at a + tz is i=1 (ai + tzi ) , so
i
if the monomial has degree d, then to obtain a term in td we must choose tzi from each
bracket, which gives td ni=1 ziri .
Q
Now if f vanishes everywhere on L, then since its dependence on t is given by a
polynomial of degree less than p, all the coefficients of that polynomial must be zero. It
follows that fd (z) = 0. But z was an arbitrary non-zero element of Fnp , and fd vanishes at
zero as well, so it vanishes everywhere.
To finish the proof, we need the second of our simple but powerful lemmas. Note that
there is an important distinction between the zero polynomial, which is the polynomial
with all coefficients of all monomials equal to zero, and a polynomial that takes the value
zero everywhere on Fnp . For instance, the polynomial xp − x is not the zero polynomial but
takes the value zero everywhere on Fp .
Lemma 8.4. Let f be a non-zero polynomial on Fnp of degree less than p. Then f is not
identically zero.
45
Note that this result is sharp, as can be seen by considering the polynomial p(x) =
xp1 − x1 .
Proof of Theorem 8.1 Combine Lemmas 8.2, 8.3 and 8.4 and the result follows straight
away, with c(n) = 1/n!.
Since we are most of the way to a proof, we briefly discuss a result called the Schwartz-
Zippel lemma, which tells us that a polynomial of degree d in n variables cannot have
too many roots. Some sort of intuition for how many roots we might expect comes from
thinking about linear polynomials: there we do not get more than pn−1 roots. A very useful
and general principle in applications of arithmetic geometry is that polynomials behave in
a rather similar way to linear functions. For example, a linear function from R to R has
at most one root, while a polynomial of degree d has at most d roots, which is the same
up to a constant.
Lemma 8.5. Let f be a non-zero polynomial of degree at most d on Fnp . Then f has at
most dpn−1 roots.
Proof. Without loss of generality the degree is exactly d. As we have already seen, if we
restrict f to the line consisting of points a + tz, then we obtain a polynomial of degree d
in the single variable t with leading coefficient fd (z), where fd is the degree-d part of f .
By Lemma 8.4 we may choose z such that fd (z) 6= 0. But that means that on every line L
in direction z the restriction of f to L is given by a polynomial in one variable of degree
d. So f has at most d roots in any line, and therefore at most dpn−1 roots in total.
The bound here is sharp as well, as can be seen by considering polynomials that depend
on x1 only. Note also that the Schwarz-Zippel lemma implies Lemma 8.4.
One use of the Schwarz-Zippel lemma appears in theoretical computer science in con-
nection with the general problem of polynomial identity testing. Here one is given two
polynomials P and Q in n variables over a finite field Fp , and the task is to decide whether
they are in fact the same polynomial. If the polynomials are presented as sums of products
of other polynomials, then it may be impractical to determine this by expanding everything
out as a sum of monomials, of which there will be nd if the polynomials have degree d, and
in general the problem appears to be very hard.
However, the Schwarz-Zippel lemma provides a probabilistic algorithm. One selects
r choices for x1 , . . . , xn independently at random and checks whether P (x1 , . . . , xn ) =
Q(x1 , . . . , xn ). If ever they differ, we know that the polynomials are not identical. Con-
versely, if P and Q have degree d and are different, then by the Schwarz-Zippel lemma the
probability that we will not detect this with one of our random choices of x1 , . . . , xn is at
most (d/p)r , since for each choice we have to pick a root of P − Q and the probability of
choosing a root is at most d/p. So by making r large (but not ridiculously large) we can
achieve an extremely high degree of confidence (falling short of total certainty) that either
the two polynomials are the same or we will detect that they are different.
46
8.2 Alon’s combinatorial Nullstellensatz
The next result, due to Noga Alon, concerns polynomials, but has a number of applications
to problems that do not appear to have anything to do with polynomials.
Lemma 8.6. Let f be a non-zero polynomial in n variables over Fp of degree k1 + · · · + kn ,
where the ki are non-negative integers and the coefficient of xk11 . . . xknn is non-zero. Let
S1 , . . . , Sn be subsets of Fp such that |Si | > ki for each i. Then f does not vanish on
S1 × · · · × Sn .
Proof. We prove this by induction on the degree of f . The result is easy when the degree
is zero.
If the degree is greater than zero, then without loss of generality k1 > 0. Let a ∈ S1
and use polynomial division to write f (x) in the form (x1 − a)P (x) + Q(x), where Q does
not depend on x1 . Since the term in xk11 . . . xknn has a non-zero coefficient in f , and the
degree of P is k1 + · · · + kn − 1, the term in xk11 −1 xk22 . . . xknn has non-zero coefficient in P .
Suppose that the conclusion is false. Then f vanishes on {a} × S2 × · · · × Sn , from
which it follows that Q vanishes on this set too. Since Q does not depend on x1 , it vanishes
on all of S1 × · · · × Sn . Therefore, (x1 − a)P vanishes on S1 × · · · × Sn , which implies that
P vanishes on (S1 \ {a}) × S2 × · · · × Sn . By induction it follows that P is also the zero
polynomial, and we are done.
As our first application of the combinatorial Nullstellensatz, we give a short proof of
the Cauchy-Davenport theorem, which is the following result (discovered independently by
Cauchy in 1813 and Davenport in 1935 – apparently it was not until 1947 that Davenport
found out that Cauchy had beaten him to it by over a century). Neither Cauchy nor
Davenport used the method below.
Theorem 8.7. Let p be a prime and let A and B be subsets of Fp . Then |A + B| ≥
min{p, |A| + |B| − 1}.
Proof. Once we have the clue that this can be proved using the combinatorial Nullstellen-
satz, we need to find a suitable sequence of sets S1 , . . . , Sn and a polynomial that vanishes
on S1 × · · · × Sn and that has small degree unless A + B is large.
This gives enough of a clue to complete the proof. The obvious sequence of sets to take,
since the only sets we have around are A and B, is (A, B). We want the degree of our
polynomial to depend on the size of A + B, and we also want it to vanish on A × B. The
most economical way of getting it to vanish at a point (a, b) is to ensure that the polynomial
has a factor x + y − (a + b), which leads to the idea of considering the polynomial
Y
f (x, y) = (x + y − c).
c∈A+B
47
r < |A|, s < |B| and r + s = |A + B|. But if we pick any r and s that satisfy the last three
conditions, which we clearly can if |A + B| ≤ |A| + |B| − 2, then the coefficient of xr y s in
the polynomial is r+s
r
, and this is non-zero because p is prime and r + s < p.
If |A| + |B| > p, then for any x the sets A and x − B intersect, so |A + B| = p.
Note that this result is sharp if A and B are arithmetic progressions in Fp with the
same common difference. Note too that the result is false in Zn if n is composite, since
then Zn has proper subgroups
Now we shall use the combinatorial Nullstellensatz to prove a variant of the Cauchy-
Davenport theorem. The result is due to da Silva and Hamidoune, who found a somewhat
involved combinatorial proof. This short argument was discovered by Alon, Nathanson
and Ruzsa. Let us write A+̇B for the set of sums a + b such that a ∈ A, b ∈ B and a 6= b.
Theorem 8.8. Let p be a prime and let A and B be subsets of Zp . Then |A+̇B| ≥
min{p, |A| + |B| − 3}.
Proof. First we show that if |A| + |B| ≥ p + 2, then A+̇B = Zp . If p = 2 the result is
trivial. To see it for p > 2, let x ∈ Zp . Then A ∩ (x − B) contains at least two elements,
so at least one of them, a does not satisfy the equation 2a = x. But then a ∈ A and
b = x − a ∈ B are distinct elements that add up to x.
We would now like to apply the combinatorial Nullstellensatz, so we need to show that
if A+̇B is too small, then some polynomial of low degree vanishes everywhere on a product
set. The natural product set to try to take is A × B. What low-degree polynomial would
vanish on A × B? Well,Q we know that A+̇B is small, so a first approximation would be to
take the polynomial c∈A+̇B (x+y−c). The trouble with that is that it does not necessarily
vanish when x = y ∈ A ∩ B. But we can take care of that by simply multiplying by the
polynomial x − y.
So now we have a polynomial of degree |A+̇B|+1 that vanishes on A×B. For technical
reasons it will be convenient to modify it slightly. Let us assume that the result is false
and let C be a set that contains
Q A+̇B and has cardinality exactly |A| + |B| − 4. Let P
be the polynomial (x − y) c∈C (x + y − c). This polynomial vanishes on A × B and has
degree |A| + |B| − 3.
Let us look at the terms in x|A|−1 y |B|−2 and x|A|−2 y |B|−1 , since if either of these has a
non-zero coefficient then we will have contradicted the combinatorial Nullstellensatz. Let
t−1 t−1
r−1 s−2
us write r = |A|, s = |B|, t = r + s − 3. Then the coefficient of x y is r−2 − r−1
t−1 t−1
r−2 s−1
and the coefficient of x y is r−3 − r−2 .
Now it is not possible for three consecutive binomial coefficients to be congruent
a mod p.
a a b a b+1
This can be checked by hand if p = 2. Otherwise, b−1 b
= a−b+1
and b b+1
= a−b
,
so we would require b − 1 ≡ a − b ≡ b + 1. It follows that the two coefficients above are
not both zero, so the result is proved.
Note that if A = B = {0, 1, 2, . . . , r − 1}, then A+̇B = {1, 2, . . . , 2r − 3}, so the result
is sharp when the sets have equal size. It is not sharp for general sizes, since for example
if B is a singleton, then |A| + |B| ≥ |A| − 1 = |A| + |B| − 2.
48
The above two results can be proved by other means, but there are results that have
been proved using the combinatorial Nullstellensatz for which no other proof is known –
just as no proof of Dvir’s theorem is known that does not go via polynomials.
It is interesting to note that there is an important difference between Dvir’s proof and
proofs that use the combinatorial Nullstellensatz. The former uses the fact that a non-
zero polynomial of low degree cannot have too many roots. That is, the zero set of a
low-degree non-zero polynomial cannot be too large. The combinatorial Nullstellensatz
places a restriction (under a slightly different hypothesis) on not just the size but also the
structure of the zero set: it cannot be a Cartesian product of sets that are too large.
Theorem 8.9. For every δ > 0 there exists n such that every subset A ⊂ Fn3 of density at
least δ contains three distinct vectors x, y, z with x + y + z = 0.
Theorem 8.10. For every δ > 0 there exists n such that every subset A ⊂ {1, 2, . . . , n} of
density at least δ contains an arithmetic progression of length 3.
It is natural to ask for more quantitative versions of these results. Meshulam’s proof
tells us that there is an absolute constant C such that the conclusion of the theorem holds
if n ≥ Cδ −1 . Turning that round, this says that if A ⊂ Fn3 is a set of density at least Cn−1 ,
then the conclusion holds. However, the densest known sets that did not contain triples
x, y, z with x + y + z = 0 were far sparser than this – they had exponentially small density
(as a function of n).
Bridging this huge gap became a notorious open problem, known as the cap-set problem,
until a remarkable development in 2016. First Ernie Croot, Seva Lev, and Péter Pál Pach
proved that a subset of Z4n with no progression of length 3 has to be of exponentially
small density, and then within a couple of weeks Jordan Ellenberg and Dion Gijswijt
independently showed that the Croot-Lev-Pach technique could be made to solve the cap-
set problem itself. Not long after that, Terence Tao came up with a related argument
that, unlike the arguments of Ellenberg and Gijswijt, treated the three variables x, y, z
symmetrically. We give Tao’s argument here.
The argument begins with a notion of rank for 3-tensors (that is, 3-dimensional general-
izations of matrices). First, note that if X, Y are finite sets, F is a field, and f : X × Y → F
is a function of two variables, then the rank of f , if we regard it as a matrix, is the minimum
49
k such that we can find functions u1 , . . . , uk : X → F and v1 , . . . , vk : Y → F such that
k
X
f (x, y) = ui (x)vi (y)
i=1
Another, slightly less obvious but very natural when one stops to think about it, is to write
it in the form
k
X
f (x, y, z) = ui (x, y)vi (y, z)wi (x, z).
i=1
Both the above definitions have their uses, but Tao introduced an “intermediate” defini-
tion, where instead of using products of one-variable functions or products of two-variable
functions, one takes products of a one-variable function and a two-variable function.
Definition 8.11. Let X, Y and Z be finite sets, let F be a field, and let f : X ×Y ×Z → F.
The slice rank of f is the smallest r such that there exist positive integers 0 ≤ r1 ≤ r2 ≤ r
and functions u1 , . . . , ur and v1 , . . . , vr such that
r1
X r2
X r
X
f (x, y, z) = ui (x)vi (y, z) + ui (y)vi (x, z) + ui (z)vi (x, y),
i=1 i=r1 +1 i=r2 +1
50
Proof. Let δa : X → F be defined by δa (x) = 1 if x = a and 0 otherwise. Then by
hypothesis we have that
X
f (x, y, z) = f (a, a, a)δa (x)δa (y)δa (z)
a∈A
for every (x, y, z) ∈ X × Y × Z, which implies that f has slice rank at most |A| (since a
product of two one-variable functions is a two-variable function).
Now let us suppose that we have a decomposition
r1
X r2
X r
X
f (x, y, z) = ui (x)vi (y, z) + ui (y)vi (x, z) + ui (z)vi (x, y).
i=1 i=r1 +1 i=r2 +1
Without loss of generality r1 > 0, so that the first term above is not P empty.
We now claim that there is a function h : X → F such that x∈X h(x)ui (x) = 0 for
i = 1, 2, . . . , r1 and such that h is non-zero outside a set of size r1 . To see this, form an
r1 × |X| matrix M with the functions ui as its rows (that is, M (i, x) = ui (x)). Then we
are trying to find a solution to the equation M h = 0 such that h does not have many
zeros. We can put M into reduced row-echelon form, and then it has s ≤ r1 non-zero
rows, and a set of s columns such that those rows and columns form a copy of the s × s
identity matrix. Let S ⊂ X be the subset corresponding to the set of columns. Then we
may choose arbitrary values h(x) for every x ∈ / S, and then the values of h inside S are
uniquely determined by the equations. The claim follows.
Now consider the function g : Y × Z → F given by the formula
X
g(y, z) = h(x)f (x, y, z).
x
which is a sum of r − r1 products of two one-variable functions. This proves that g has
rank at most r − r1 .
It follows that |A| ≤ r, so the slice rank is at least |A|.
The connection between slice rank and the cap-set problem comes with the following
key observation. Suppose that A is a subset of Fn3 that does not contain distinct elements
x, y, z with x + y + z = 0. Then if x, y, z belong to A and are not all the same, we must
51
have that x + y + z 6= 0 and hence that there exists i such that xi + yi + zi 6= 0. And this
last assertion can be written in a polynomial form as
1 − (xi + yi + zi )2 = 0.
Lemma 8.13. The slice rank of the polynomial P (x, y, z) = ni=1 (1 − (xi + yi + zi )2 ) is at
Q
most 3M , where M is the number of 012-sequences of length n that sum to at most 2n/3.
52
9 Huang’s solution to the sensitivity conjecture
Just over a year ago (at the time of writing in late 2020), a young mathematician called
Hao Huang astonished theoretical computer scientists by not merely solving a famous
open problem, known as the sensitivity conjecture, but by doing so with an extraordinarily
short proof. As Scott Aaronson put it on his blog, it is a good illustration of the difference
between the complexity classes P and NP, because the proof is extremely short and easy
to understand, but it took a lot of very good mathematicians a few decades to find it.
I shall not say what the sensitivity conjecture is here, because it had been shown to
follow from a very simply stated problem about the discrete cube, so it is the latter problem
and its solution that I shall discuss. Write Qn for the Hamming cube, that is, the graph
with vertex set {0, 1}n where we join x to y if they differ in exactly one coordinate. The
theorem that Huang proved is the following.
Theorem 9.1. Let G be an induced
√ subgraph of Qn with 2n−1 + 1 vertices. Then the
maximum degree of G is at least n.
Note that the Qn is bipartite, since if two vertices are joined, then the numbers of 1s in
the corresponding sequences have different parity. So the theorem says that if we go from
2n−1 vertices
√ to 2n−1 + 1 vertices, then the smallest possible maximum degree jumps from
zero to n.
To begin the proof, we define a sequence of orthogonal matrices, the inductive con-
struction of which is reminiscent of that of the Walsh matrices we saw earlier in the course.
We start with A0 = (0), and then we define
An−1 I
An = .
I −An−1
Here, I is the 2n−1 × 2n−1 identity matrix.
It is easy to see (by induction) that An is symmetric. We also have that its rows are
orthogonal. To see this, note that it is trivial by induction if the two rows are either both
in the top half or both in the bottom half. Otherwise, if we take the ith row from the top
half and the jth row from the bottom half, we obtain the inner product (An−1 )ij −(An−1 )ji ,
which is zero because An−1 is symmetric.
Another feature of the matrix An is that its entries are all −1, 0 or 1 and that there
are n + 1 non-zero entries in each row (and hence also in each column). In fact, we can
say more. If we index the rows and columns not with elements of the set {1, 2, . . . , 2n }
but instead with 01-sequences of length n, where the first half of the indices for An are
obtained from those of An−1 by adding a 0 on the end, and the second half are obtained
by adding a 1 on the end, then (An )xy is non-zero only if x and y differ in exactly one
coordinate. Indeed, if x and y share their last coordinate, then this is true by induction,
and if they do not share it, then (An )xy is an entry of one of the two identity-matrix parts
of An , which implies that x and y are equal in all the other coordinates.
Thus, An is obtained from the adjacency matrix of Qn by attaching signs to the edges.
We can describe these signs inductively by saying that when xn = 0 we use the signs from
53
An−1 , when xn = 1, we multiply all those signs by -1, and for the edges that go across from
xn = 0 to xn = 1 we have 1s everywhere. For a non-inductive description (which we will
not need), if x and y agree except at the ith coordinate, then Axy is (−1)xi+1 +···+xn .
An immediate consequence of these observations is that A2n = nI. (The orthogonality
implies that An ATn = nI and the symmetry
√ implies
√ that An = ATn .) From this it follows
that every eigenvalue of An is either n or − n. Since tr An = 0, the two eigenvalues
each occur with multiplicity 2n−1 .
We are now ready to prove the theorem.
Proof of Theorem 9.1. Let V be a subset of {0, 1}n of size 2n−1 + 1. Let us define a linear
map α : RV → RV by taking a function f : V → R to the function g : V → R defined by
X
g(v) = (An )vw f (w).
w∈V
54
Actually, I know for a fact that that is not quite Huang’s thought process, since he used
a result called Cauchy’s interlace theorem, though the basic ideas were the same. The even
simpler proof presented above was spotted by Shalev Ben-David and shared in a comment
on Scott Aaronson’s blog post.
10 Dimension arguments
The topic of this last section wraps up the course quite nicely, as it will require us to
touch on a number of earlier topics that we have covered, including averaging arguments,
set systems with restrictions on intersections, combinatorial geometry, the Borsuk-Ulam
theorem, and use of polynomials.
The basic idea behind a dimension argument is that one can sometimes obtain an
upper bound for the size of a combinatorial object by constructing an injection from that
object to a low-dimensional vector space such that the image is a linearly independent set.
Then the dimension of the vector space gives the desired lower bound. This technique is
particularly useful for extremal problems where the maximum is far from unique, since it
transforms those problems into other problems (of the form, “How large an independent set
can we find in this vector space?”) that have the same property. We shall use dimension
arguments to prove a number of results about set systems, and then we shall discuss a
famous solution by Jeff Kahn and Gil Kalai to a problem in combinatorial geometry.
55
For each i, let fi be the n-variable polynomial given by fi (x) = f (x, ai ). Then these
polynomials are linearly independent, since fi (ai ) = a2 b2 , and fi (aj ) = 0 for every j 6= i.
To squeeze as much information as we can out of this observation, we would now like
to prove that the polynomials fi all lie in a low-dimensional subspace of the space of n-
variable polynomials. Since they are all quartics, we can obtain an upper bound of order
n4 with ease, but with slightly more thought we can do substantially better. Note that
X X X
d(x, ai )2 = x2j − 2 xj aij + a2ij
j j j
, where
P we write aij for the jth coordinate
P 2 of ai . It follows that fi (x) is a linear combination
2 2
of ( j xj ) , the n polynomials xh j xj , the n(n + 1)/2 polynomials xh xj with h ≤ j, the
n polynomials xj , and 1. So the fi live in a space of dimension 1 + n + n(n + 1)/2 + n + 1 =
(n + 1)(n + 4)/2.
The above bound is not the best known: using a strengthening of the same method,
Blokhuis obtained an upper bound of (n + 1)(n + 2)/2. But it is still an open problem to
determine the exact bound.
Theorem 10.2. Let A be a family of subsets of [n] of even size such that any two distinct
sets in the family have an intersection of odd size. Then |A| ≤ n if n is odd, and |A| ≤ n−1
if n is even.
Proof. We begin by proving the same statement but with “even” and “odd” reversed: that
is, we assume that the sets have odd size and that their intersections have even size.
Associate with each set A ∈ A its characteristic vector χA and regard that vector as
n
an element
P of F2 . We can define an F2 -valued inner product in the obvious way, namely
hf, gi = x f (x)g(x), where the sum is in F2 . Note that with respect to this inner product,
56
we have that hχA , χA i = 1 and hχA , χB i = 0 for every A, B ∈ A. In other words, the
characteristic vectors form an “orthonormal basis” over F2 . P
The usual proof that orthogonal
P vectors are independent works here too: if A λA χA =
0, then for every B, 0 = h A λA χA , χB i = λB . Therefore, |A| ≤ n.
Note that the sets {1}, {2}, . . . , {n} show that this bound is best possible.
Now let us assume that the sets have even size and odd intersections.
If n is odd, we can replace each A by [n] \ A, which yields a family of sets of odd size
with even intersections. (This last assertion follows easily from the fact that if A and B are
sets with cardinalities of the same parity, then |A \ B| and |B \ A| have the same parity.)
Thus |A| ≤ n.
As noted above, the sets {1, 2}, . . . , {1, n}, {2, 3, . . . , n} show that this bound is best
possible.
If n is even, then replace each A ∈ A that contains n by [n] \ A and leave the remaining
sets as they are. If A and B are any two sets and |A| is even, then |A ∩ B| and |A ∩ B c |
have the same parity, and since n is even, B and B c have the same parity as well. Also,
A cannot contain a complementary pair since all intersections have odd size. Therefore,
we obtain a new set system with the same property and the same number of sets, but now
with all sets contained in [n − 1]. Since n − 1 is odd, we find that |A| ≤ n − 1.
The sets {1, 2}, . . . , {1, n} show that this bound is best possible.
Now let us consider what happens if the sets and their intersections have even sizes.
Theorem 10.3. Let A be a collection of subsets of [n] and suppose that |A| and |A ∩ B|
are even for every A, B ∈ A. Then |A| ≤ 2bn/2c .
Proof. Bounding A is equivalent to bounding the size of a subset F ⊂ Fn2 with the property
that hf, gi = 0 for every f, g ∈ F.
Let k be maximal such that F contains a linearly independent set of size k. Then F lies
in the orthogonal complement of that set, which has size 2n−k , so |F| ≤ 2n−k . But F also
lies in the linear span of that set, so |F| ≤ 2k . It follows that |F| ≤ maxk min{2k , 2n−k } =
2bn/2c .
57
Proof. Given a set A ∈ A, define a polynomial PA over Fp in n variables by
X
PA (x) = 1 − ( xi )p−1 .
i∈A
P
Note that PA (x) = 1 if i∈A xi = 0 and PA (x) = 0 otherwise (by Fermat’s little theorem).
Note also that PA has degree p − 1.
If x is the characteristic function of a set B ∈ A, then PA (x) = 1 if A = B and 0
otherwise, since in the second case our hypothesis implies that PA (x) is not zero mod p.
This proves that the polynomials PA are linearly independent.
However, this is not enough to give us the bound stated. For that, we need a further
trick, which is to observe that if P is any polynomial in n variables and we replace each
occurrence of xri by xi (so for example the monomial x31 x64 x11
5 would become the monomial
x1 x4 x5 ), then we obtain a new polynomial that takes the same value as P on any x that
has all its coordinates equal to 0 or 1, since for such an x we have that xri = xi for every
r ≥ 1.
So now let us replace each PA by a polynomial QA that agrees with PA on all charac-
teristic functions of sets and has degree at most 1 in each variable xi . Then QA (χB ) =
PA (χB ) = δAB for every A, B ∈ A, so the QA are also independent.
The number of monomials of degree at most p − 1 and of degree at most 1 in each
variable xi is
n n n
+ + ··· + ,
0 1 p−1
so the QA live in a space of this dimension, which proves the result.
Note that in the case p = 2 this gives a bound of n + 1, which is pretty close to the
exact bound we obtained in the previous subsection.
An interesting special case of the above result is where n = 4p and the sets all have
size 2p. Then the result tells us that we have a collection A of subsets of [4p] of size 2p, no
two of which are disjoint and no two ofPwhich have an intersection of size exactly p. With
this constraint, we deduce that |A| ≤ m<p 4p m
, which is exponentially smaller than 24p .
From this, we can obtain the following geometrical consequence.
Corollary 10.5. Let p be an odd prime and let n = 4p. Then the largest measure of a set
X of unit vectors in R4p that contains no pair of orthogonal vectors is exponentially small.
√
Proof. Let Y be the collection of unit vectors all of whose coordinates are either 1/2 p or
√
−1/2 p and that sum to zero. If A ⊂ [4p] is a set of size 2p, let vA be the vector in Y
√ √
that takes the value 1/2 p on A and −1/2 p on [4p] \ A. Observe that hvA , vB i = 0 if
and only if |A ∩ B| = p. It therefore follows from Theorem 10.4 that |X ∩ Y | ≤ p−1 n
P
m=0 m ,
which is exponentially smaller than 2n since p = n/4.
Since the property of not containing a pair of orthogonal vectors is invariant under
n
rotation, it also
Pp−1follows
n
that if ρ is any rotation of R , the intersection ρX ∩ Y also has
size at most m=0 m .
58
Now we do an averaging argument very similar to the averaging step in Katona’s proof
of the Erdős-Ko-Rado theorem. Let ρ be a random rotation of Rn . Then for each y ∈ Y ,
the probability that y ∈ ρX is µX (where µ is the rotation-invariant probability measure
Pp−1 n
on the sphere), so the expected size of ρX ∩ Y is µ|Y |. But it is also at most m=0 m ,
n −1 Pp−1 n
from which we deduce that µ ≤ 2p m=0 m . The estimates in Section 2 imply easily
that this is exponentially small as a function of n.
n n
To justify the last sentence of the above proof slightly more, 2p = n/2 , which is
−1/2 n n
P
n 2 up to a constant factor, and we also know that if α < 1/2, then the sum m≤αn m
is exponentially smaller than 2n , by Lemma 2.6.
59
2
Now let X ⊂ Rn be the set of all points v ⊗ v such that v ∈ Y . (We saw this notation
in the section on the cap-set problem: v ⊗ v denotes the matrix with ijth entry vi vj , which
2
we can think of as a point in Rn .) For any v, w ∈ Rn ,
X
hv ⊗ v, w ⊗ w = vi vj wi wj = hv, wi2 ≥ 0,
ij
with equality if and only if v and w are orthogonal. This calculation also shows that if v
and w are unit vectors, then so are v ⊗ v and w ⊗ w, which implies that
√
with equality if and only if hv, wi = 0. It follows that X has diameter 2, and that no
subset of X of smaller diameter contains a pair of vectors vA ⊗ vA and vB ⊗ vB such that
vA and vB are orthogonal, or equivalently such that |A ∩ B| = p. The result now follows
from Theorem 10.4.
n
Pp−1 n
It follows immediately that X cannot be covered by fewer than 2p m=0 m subsets
of smaller diameter. This is exponentially large in n, and so exponentially large in the
square root of the dimension of the space that contains X.
60