A. The Lesson

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

Topics in combinatorics

W.T. Gowers

This course will cover a miscellaneous collection of topics in combinatorics and closely
related fields. What the topics have in common is that they all involve proofs that at
one time surprised experts by their simplicity. Sometimes these were the first proofs of
long-standing open problems, and sometimes they were new proofs of results that had
previously been established by much longer arguments. Several of these arguments use
ideas and techniques that have gone on to be used by many other people.
Another theme of the course is the sheer diversity of methods that are used in combi-
natorics. We shall see uses of probability, linear algebra, linear analysis, topology, entropy,
multivariate polynomials, tensor rank, concentration of measure, and more. (There will
also be one or two arguments that are completely elementary.)

Contents
1 Averaging arguments 2
1.1 Sperner’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 The Erdős-Ko-Rado theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 The Szemerédi-Trotter theorem . . . . . . . . . . . . . . . . . . . . . . . . 5

2 A few bounds on binomial coefficients 7

3 Well-separated sets and vectors 11


3.1 Associating vectors with sets . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Hadamard matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 The sum-product problem 16

5 The chromatic number of the Kneser graph 20


5.1 The Borsuk-Ulam theorem and some variants . . . . . . . . . . . . . . . . 21
5.2 What has the Borsuk-Ulam theorem got to do with combinatorics? . . . . 22
5.3 Proof of Kneser’s conjecture . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Marcus and Tardos’s solution to the Stanley-Wilf conjecture 24

1
7 Entropy arguments 29
7.1 The Khinchin axioms for entropy and some simple consequences . . . . . . 29
7.2 The number of paths of length 3 in a bipartite graph . . . . . . . . . . . . 32
7.3 The formula for entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.3.1 The axioms uniquely determine the formula. . . . . . . . . . . . . . 36
7.4 Brégman’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.5 Shearer’s entropy lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

8 The polynomial method 44


8.1 Dvir’s solution to the Kakeya problem for finite fields . . . . . . . . . . . . 44
8.2 Alon’s combinatorial Nullstellensatz . . . . . . . . . . . . . . . . . . . . . . 47
8.3 The solution to the cap-set problem . . . . . . . . . . . . . . . . . . . . . . 49

9 Huang’s solution of the sensitivity conjecture 53

10 Dimension arguments 55
10.1 Subsets of Rn that give rise to at most two distances . . . . . . . . . . . . 55
10.2 Set systems with intersections of restricted parity . . . . . . . . . . . . . . 56
10.3 More general restricted intersections . . . . . . . . . . . . . . . . . . . . . . 57
10.4 Kahn and Kalai’s disproof of Borsuk’s conjecture . . . . . . . . . . . . . . 59

1 Averaging arguments
Let a1 , . . . , an be real numbers with average a. Then there exists i such that ai ≥ a and
there exists i such that ai ≤ a. More generally, we have the following extremely basic fact.

Proposition 1.1. If X is a random variable, then P[X ≥ EX] > 0 and P[X ≤ EX] > 0.

Proof. It might seem as though this fact is too obvious even to need a proof. That is
almost true, and for the purposes of this course there is no problem if you just assume it,
but for completeness here is a proof.
If P[X ≥ EX] = 0, then for each n let En be the event that

X ∈ (EX − (n − 1)−1 , EX − n−1 ],

where we interpret EX − 0−1 to be −∞. Then n P[En ] = 1. Therefore,


P

X X
EX ≤ (EX − n−1 )P[En ] = EX − n−1 P[En ],
n n

which is strictly less than EX , since there must exist n with P[En ] > 0. This is a contra-
diction.

2
Another way of stating the proposition is to say that EX ∈ [min X, max X]. Bizarrely,
this fact, which is particularly basic if the random variable X is discrete, the case that will
normally concern us, can quite often be used to prove highly non-trivial theorems. This
section will discuss three main examples: Sperner’s theorem, the Erdős-Ko-Rado theorem,
and the Szemerédi-Trotter theorem.
Before I get on to those, let me mention a closely related very basic principle, which
underlies so-called double-counting arguments. It can be formalized in various ways, but
here’s a simple one.

Proposition 1.2. Let G be a bipartite graph with finite vertex sets X and Y . Suppose
that the average degree of a vertex in X is δ|Y |. Then the average degree of a vertex in Y
is δ|X|.

Proof. The number of edges of G is δ|X||Y | and the result follows instantly.
This is “double counting” because we are counting the number of edges in two different
ways: as |X| times the average degree of a vertex in X, and as |Y | times the average degree
of a vertex in Y .
The following proposition, which has essentially the same proof, perhaps captures better
how double-counting arguments often work.

Proposition 1.3. Let G be a bipartite graph with finite vertex sets X and Y . Suppose that
the average degree of a vertex in X is at least d1 and the average degree of a vertex in Y
is at most d2 . Then |Y | ≥ d1 |X|/d2 .

Proof. The number of edges of G is at least d1 |X| and at most d2 |Y |. The result follows.
Here are two simple examples of double-counting arguments.

Example 1.4. Let G be a planar graph with V vertices, E edges and F faces. Then each
face contains at least three edges and each edge is contained in at most two faces. It follows
from Proposition 1.3 that E ≥ 3F/2. (Let X be the set of faces of G and Y be the set of
edges of G, and join a face to an edge if and only if the face contains the edge. Then apply
the proposition.)

Before I give the next example, let me introduce some standard notation that will be
useful throughout the course. I shall write [n] for {1, 2, . . . , n}, and [n](r) for the set of
subset of [n] of size r.

Example 1.5. Let 0 ≤ r < s ≤ n, let A be a subset of [n](r) , and let ∂s A be the s-upper
shadow of A, which is defined to be {B ∈ [n](s) : ∃A ∈ A s.t. A ⊂ B}. Then  every set in
A is contained in n−r
n−s
sets in ∂s A, and every set in ∂ s A contains at most s
r
sets in A. It
n−r s (n−r)!r!
 
follows that |∂s A| ≥ n−s |A|/ r , which can also be written as (n−s)!s! |A|, and therefore
as ns |A|/ nr .
 

3
1.1 Sperner’s theorem
Let A be a family of subsets of [n]. How large can A be if no member of A is contained in
any other? A simple way to construct such a family is to ensure that all its members have
the same size. Sperner showed that that was the best one can do.
Theorem 1.6 (Sperner). Let A be a family of subsets of [n] such that no member of A is
n
contained in any other. Then |A| ≤ bn/2c .

Proof. For 0 ≤ i ≤ n let Ai be the set A∩[n](i) – that is, the collection of sets in A of size i.
Let x1 , . . . , xn be a random ordering of [n] and for 0 ≤ i ≤ n let Ei be the set {x1 , . . . , xi }.
Then the number of i such that Ei ∈ A is at most 1, since no member of A is contained
in any other. But for eachPi, Ei is uniformly distributed in [n](i) , so the expected number
of i such that Ei ∈ A is ni=0 |Ai |/ ni . (Here we used linearity of expectation, another


very
Pn simple nprinciple that surprisingly often has non-trivial consequences.) It follows that
n
 
i=0 |A i |/ i
≤ 1, and therefore, since the largest size of a binomial coefficient is bn/2c ,
n
P 
that |A| = i |Ai | ≤ bn/2c as claimed.
Note that the last step in the proof above threw away a lot of interesting information:
the above proof shows that, as one might expect, including lots of sets that are either very
small or very big makes it harder to avoid containments. The proof itself is not the original
proof of Sperner.
Another remark is that the proof used the contrapositive of Proposition 1.1.

1.2 The Erdős-Ko-Rado theorem


Let A be a family of subsets of [n] of size k. If any two members of A intersect, then how
n

large can A be? If k > n/2, then the answer is trivially k . If k = n/2, then the answer
is not trivial but it is still easy. If A ⊂ [n] is a set of size k, then we cannot pick both A
and Ac , and it is simple to see that if from every pair (A, Ac ) we pick exactly one set, then
any two
 members of the resulting family of sets intersect. So when k = n/2 the answer
1 n
is 2 k . When k < n/2, the problem is no longer easy, though the answer is simple and
natural: the largest intersecting family is obtained by taking all sets that contain some
given element.
The proof we shall give of the next theorem is not the one given by Erdős, Ko and
Rado, but rather a later one discovered by Gyula Katona. Like the proof of Sperner’s
theorem above, it makes use of a random ordering.
Theorem 1.7 (Erdős, Ko, Rado). Let k < n/2 and let A ⊂ [n](k) be an intersecting family.
n−1

Then |A| ≤ k−1 , with equality if and only if A is of the form {A ∈ [n](k) : i ∈ A}.
Proof. Let x1 , . . . , xn be a random cyclic ordering of [n]. (That is, we think of x1 as the
successor of xn .) For each j, let Ij be the “interval” {xj , xj+1 , . . . , xj+k−1 }, where addition
is mod n. We now show that at most k of these intervals can be part of an intersecting
family.

4
To see this, let us assume, without loss of generality, that I1 = {x1 , . . . , xk } belongs
to the family. Also, for each j let us define Ij− to be the interval {xj−k , . . . , xj−1 }. Then
every interval that belongs to the family must intersect I1 , which means that it must be
either Ij for some j ∈ {1, 2, . . . , k} or Ij− for some j ∈ {2, 3, . . . , k}. (I did not mention

Ik+1 because it is equal to I1 , but this is not an important point.) Also, we cannot include
both Ij and Ij− since they are disjoint (because k ≤ n/2). It follows, as claimed, that at
most k intervals can belong to an intersecting family.
For each j, the probability that Ij belongs to the family is |A|/ nk , since Ij is uniformly

n

distributed. So the expected number of intervals that belong to the family is n|A|/ k
.
n k n n−1

Therefore, n|A|/ k ≤ k, and therefore |A| ≤ n k = k−1 .
Note that if equality holds, then for every cyclic ordering there must be k consecutive
intervals in A. (Here we are using the fact that if P[X ≤ EX] = 1, then P[X = EX] = 1.)
But if exactly k intervals belong to an intersecting family, then we can say slightly more
than we said above. For 1 < j < k, it is not possible for Ij− and Ij+1 both to belong to A
(here we use the fact that n ≥ 2k +1), but for each such 1 < j ≤ k we have to choose either
Ij− or Ij in order to have enough intervals. Therefore, if ever we choose Ij− , then we must
− −
choose Ij+1 , which forces us to choose Ij+2 , and so on all the way up to Ik− . It follows that

there is some 1 ≤ r ≤ k such that the intervals we choose are I1 , I2 , . . . , Ir , Ir+1 , . . . , Ik− , or
in other words all intervals of length k that contain xr .
This observation enables us to show uniqueness of the extremal example. Let x1 , . . . , x2k−1
be such that the intervals {xj , . . . , xj+k−1 } with 1 ≤ j ≤ k all belong to A, and let u be an
element not equal to any of x1 , . . . , x2k−1 (which exists since 2k ≤ n). Now let A be any set
of size k that contains xk , and enumerate its elements as {y1 , . . . , yk } in such a way that
y1 , . . . , yr belong to the set {x1 , . . . , xk }, yr = xk , and yr+1 , . . . , yk do not. Now construct
a cyclic order of [n] that begins with u, then continues with an ordering of x1 , . . . , xk that
finishes y1 , . . . , yr , and continues after that with yr+1 , . . . , yk (and then finishes in some
arbitrary way). Write this ordering as z0 , z1 , . . . , zn−1 with z0 = u. Then the interval
{z0 , z1 , . . . , zk−1 } is equal to the set {u, x1 , . . . , xk−1 }, which does not belong to A because
it does not intersect {xk , . . . , x2k−1 }. Since there must be k consecutive intervals in this
cyclic order that belong to A, and {z1 , . . . , zk } = {x1 , . . . , xk } does, it follows that A, which
equals {zk−r+1 , . . . , z2k−r }, belongs to A.
Therefore, A contains every set of size k that  contains the element xk . Since this is a
n−1
maximal intersecting family and has size k−1 , we are done.

1.3 The Szemerédi-Trotter theorem


Let x1 , . . . , xn be points in the plane and let L1 , . . . , Lm be lines. An incidence is a pair
(xi , Lj ) such that xi ∈ Lj . If all the points and all the lines are distinct, then how many
incidences can there be? This question is answered, up to a multiplicative constant, by a
beautiful theorem of Szemerédi and Trotter. Again, the proof we give is not the original
one, but a much simpler argument, due to György Elekes, which came as a major surprise
and led to significant progress in many questions in combinatorial geometry.
The starting point is another question, which is interesting in its own right. Let G be

5
a graph. The crossing number of G is defined to be the smallest number of crossings (that
is, pairs of edges that intersect) in any drawing of G in the plane. So a planar graph is
a graph that has crossing number 0, and the larger the crossing number, the further the
graph is from being planar.
Let us now investigate the connection between the crossing number of a graph and how
many edges it has.

Lemma 1.8. A planar graph with n vertices has at most 3n − 6 edges.

Proof. Let e and f be the number of edges and faces of G, respectively. Euler’s formula
tells us that n − e + f = 2. We also saw in Example 1.4 that e ≥ 3f /2. It follows that
2 = n − e + f ≤ n − e + 2e/3 = n − e/3. Rearranging gives the lemma.

Corollary 1.9. A graph with n vertices and m edges has crossing number at least m − 3n.

Proof. Let G be such a graph and suppose we have drawn it in the plane. By Lemma 1.8,
there is at least one crossing. Remove one of the two crossing edges and the number of
edges is still greater than 3n − 6, so there is another crossing. We can keep doing that
m − 3n times with the number of edges remaining greater than 3n − 6, so there must be
at least m − 3n crossings in the original drawing of G.
Now comes the averaging argument, which for large m will greatly boost the bound
just proved.

Corollary 1.10. Let G be a graph drawn in the plane with n vertices and m edges with
m ≥ 6n. Then there are at least m3 /72n2 crossings.

Proof. Let p be a probability to be chosen later, and let H be an induced subgraph of G


chosen by picking each vertex of G independently at random with probability p.
Suppose that the edges xy and zw cross in G. Then the probability that this crossing
belongs to H as well is p4 . Therefore, if t is the number of crossings in G, then the expected
number of crossings in H is p4 t.
But the expected number of edges in H is p2 m and the expected number of vertices is
pn. Therefore, by Corollary 1.9 the expected number of crossings in H is at least p2 m−3pn.
It follows that p4 t ≥ p2 m − 3pn.
It remains merely to make a good choice of p. We choose it so that p2 m = 6pn: that
is, we take p to be 6n/m. That gives us that p4 t ≥ 3pn, so t ≥ 3n/p3 = m3 /72n2 .
It is not immediately obvious what crossing numbers have to do with incidences. How-
ever, as we shall see, given a system of points and lines, one can define a graph drawing in a
natural way, and the crossing number estimate turns out to give us exactly the information
we are looking for.

Theorem 1.11 (Szemerédi, Trotter). Given n distinct points and m distinct lines in the
plane, the number of incidences is at most 8 max{m, n, m2/3 n2/3 }.

6
Proof. Regard the system of points and lines as a drawing of a graph as follows. The
points are the vertices. For each line L, enumerate the vertices along that line in order as
x1 , . . . , xk , and for each i < k regard the segment of L from xi to xi+1 as an edge.
Let the number of incidences be s. Let the lines be L1 , . . . , Lm and for each i let the
number of points on line Li be si . Then s1 + · · · + sm = s, and the number of edges of
the graph is (s1 − 1) + · · · + (sm − 1) = s − m. Therefore, the number of crossings is, by
Corollary 1.10, at least (s − m)3 /72n2 if s − m ≥ 6n.
However, since two lines  cross2 in at most one point, we also know that the number
m
of crossings is at most 2 ≤ m /2. So if s ≥ 6n + m, then (s − m) /72n2 ≤ m2 /2, 3

which implies that (s − m)3 ≤ 36m2 n2 . From this it follows that either s ≤ 6n + 2m ≤
max{8m, 8n} or (s/2)3 ≤ 36m2 n2 . In the second case, s ≤ 8m2/3 n2/3 . The result follows.

Three examples help to make sense of the bound in the theorem. If there is just one
line and it contains all the points, then the number of incidences is n. Similarly, if there is
just one point and it is contained in all the lines, then the number of incidences is m. And
if we take the grid {0, 1, . . . , r − 1} × {0, 1, . . . , 2r2 − 1} and all lines that pass through one
of the points (0, y) with y < r2 and have slope belonging to the set {1, 2, . . . , r}, then there
are 2r3 points and r3 lines, and each line passes through r points, which gives r4 incidences,
equalling the Szemerédi-Trotter bound up to a constant. With a bit more thought one can
generalize these examples and show that the bound is sharp up to a constant for all values
of m and n.

2 A few bounds on binomial coefficients


Combinatorics often involves estimates and those estimates often involve factorials and
binomial coefficients, so it is helpful to have a few rough and ready bounds for them.
For factorials one can use Stirling’s formula, but for most applications that level of
accuracy isn’t needed, so I prefer the following bounds, which are much easier to prove.

Lemma 2.1. For every n, (n/e)n ≤ n! ≤ e(n + 1)(n/e)n .


Rn
Proof. Suppose you want to obtain crude bounds for 1 log x dx. If you decompose the
interval [1, n] into intervals of width 1 and look at the upper and lower Riemann sums, you
will deduce that
n−1
X Z n n−1
X
log m ≤ log x dx ≤ log(m + 1).
m=1 1 m=1

But the antiderivative of log x is x log x − x, so the expression in the middle is equal to
n log n − n + 1. Taking exp of both sides, we deduce that

(n − 1)! ≤ en log n−n+1 ≤ n!

for every n.

7
Now en log n−n+1 = e(n/e)n , from which we obtain the lower bound. Also, (n + 1)n =
(1 + 1/n)n nn ≤ enn , so

e((n + 1)/e)n+1 = e−n (n + 1)n (n + 1) ≤ e(n + 1)(n/e)n ,

which gives us the upper bound.


This tells us in particular that (n/e)n is a rough and ready approximation to n! that is
useful in contexts where we are not concerned by a stray multiple of n or so.
For some purposes  this is not enough. For example, it is often useful to have a good
−n n
estimate for 2 n/2 (when n is even) and if we simply plug in the bounds for factorials
then we get a trivial upper bound of 1 and a fairly weak lower bound. But we can use
a completely elementary argument to obtain a bound that is correct up to a constant
multiple.
n
Lemma 2.2. For every even integer, √12n ≤ 2−n n/2 ≤ √1n .


Proof. Let us begin by writing an exact formula for the binomial coefficient in question.
We have
  n/2
−n n n! 1.2.3 . . . n 1.3.5 . . . (n − 1) Y 1
2 = n = = = (1 − ).
n/2 2 (n/2)!(n/2)! 2.4.6 . . . n.2.4.6 . . . n 2.4.6 . . . n m=1
2m

To obtain an upper bound for this, let us observe that


 2
1.3.5 . . . (n − 1) 1.2.3.4 . . . n 1
≤ = .
2.4.6 . . . n 2.3.4.5 . . . (n + 1) n+1

Similarly, a lower bound is given by


1.1.2.3 . . . (n − 1) 1
=
2.2.3.4 . . . n 2n
Taking square roots gives the result.
n
Remark 2.3. The quantity 2−n n/2

is the probability that an n-step random walk ends
up at the origin. The end point of such a random walk is a sum X = X1 + · · · + Xn where
the Xi are independent random variables that take the values ±1, each with probability
1/2. Since√variances of independent random variables add, VarX = n, so X has standard
deviation n. We expect the values near zero to be taken with roughly equal probability
(as long as they are even), so it is no surprise that the probability of landing exactly on
zero should be of order n−1/2 .

Another bound that is often useful when m is a lot smaller than n is the trivial bound
 
n n(n − 1) . . . (n − m + 1)
= ≤ nm .
m m!

8
Sometimes one wants to be just a little more accurate than this, in which case we can use
the fact that m! ≥ (m/e)m to improve this upper bound to (en/m)m . Note that that is true
whether or not m is small, and indeed it can be useful even when m = αn for a constant
α, as long as that constant is not too close to 1/2 (because then it does not improve on
the trivial bound of 2n ).
n

In a later section we shall give a more accurate bound for m when m = αn for a
constant α independent of n.
Next, it is surprisingly useful to think about ratios of consecutive binomial coefficients.
Directly from the formula one has that
  . 
n n n−m−1
= .
m+1 m m

In particular, if m = αn then the ratio is roughly (1 − α)/α. From this it follows if α < 1/2
that          
n n n n 1 n 1−α
+ + ··· + ≤ = .
0 1 m m 1 − α/(1 − α) m 1 − 2α
The important point here is not the precise estimate, but the fact that if α is bounded
away from 1/2, then to within a constant the sum of the all the binomial coefficients up
n n

to m is the same as m itself. Often we don’t care about  constant factors, so then we
n n

can say that the sum of the binomial coefficients up to m is basically m .
To give an idea of what one can do with ratios, we now prove a lemma that shows that
almost all of the discrete cube is concentrated around the middle layers. This is a simple
example of the concentration of measure phenomenon, which we shall see more about later
in the course.
n 2
Lemma 2.4. Let m = (1/2 − θ)n with 0 < θ ≤ 1/2. Then 2−n m ≤ e−θ n/2 .


Proof. If k ≤ (1/2 − θ/2)n, then


  . 
n n k 1/2 − θ/2
= ≤ ≤ 1 − θ,
k−1 k n−k+1 1/2 + θ/2
It follows that
   
n n 2
≤ (1 − θ) θn/2
≤ e−θ n/2 .2n ,
(1/2 − θ)n (1/2 − θ/2)n
which proves the result.
Remark 2.5. I was not careful about whether (1/2 − θ/2)n was an integer, so the proof
above is strictly speaking not quite correct. However, it is not worth the bother of correcting
it, since we are about to see a different proof that does not run into this kind of difficulty.
(But if you do want to correct it, you could note that the proof works in the case that m
and n/2 − m are both even integers, and we threw quite a lot away, so getting it for the
coefficients in between is not too hard.)

9
Next, we prove a rather more general estimate. It can itself be considerably generalized,
but even in this form it is useful.

Lemma 2.6. Let X1 , .P. . , Xn be independent random variables of mean zero taking values
2
in [−1, 1] and let X = i Xi . Then P[X ≥ n] ≤ e− n/4 .

Proof. We use a trick that can be used to prove many results of this flavour, which is to
look at the exponential moment EeλX , apply Markov’s inequality, and optimize over λ. We
have
P
EeλX = Eeλ i Xi
Y
=E eλXi
i
Y
= EeλXi
i

Now since EXi = 0 for each i,

λ2 λ3
EeλXi = 1 + EXi2 + EXi3 + . . . .
2! 3!
2 3
If λ ≤ 1, then since |Xi | is always at most 1, the right-hand side is at most 1+ λ2! + λ3! +· · · ≤
2 2
eλ . So EeλX ≤ eλ n .
By Markov’s inequality,
2 n−λn
P[X ≥ n] = P[eλX ≥ eλn ] ≤ eλ .
2 n/4
Choosing λ = /2 we obtain an upper bound of e− .

Remark 2.7. The constant 1/4 in the bound above can be improved, but at the cost of
making the argument a bit more complicated.
2 n/4
Remark 2.8. Replacing each Xi by −Xi , we find that P[X ≤ −n] ≤ e− as well.

Corollary
Pm n2.9. Let  > 0, let n be a positive integer, and let m = (1/2 − )n. Then
−n −2 n
2 k=0 k ≤ e .

P random variables that take the values ±1, each


Proof. Let X1 , . . . , Xn be independent
with probability 1/2 and let X = i Xi . Then the quantity we are trying to bound is
2
P[X ≤ −2n]. By Remark 2.8, this is at most e− n .

10
3 Well-separated sets and vectors
Variants of the following question come up frequently when one works on combinatorial
problems. Let A1 , . . . , AN be subsets of [n] of size n/2 and suppose that |Ai ∩ Aj | ≤ αn
for every i 6= j. How large can N be?
The answer turns out to depend quite heavily on α, and in particular on how α compares
with 1/4, the significance of which is that the expected size of the intersection of two random
sets of size n/2 is n/4.

Theorem 3.1. Let α > 1/4. Then there can be exponentially many subsets of [n] of size
n/2 such that no two intersect in more than αn.

Proof. Let A be a random set of size n/2. We begin by estimating the probability that
|A ∩ {1, 2, . . . , n/2}| > αn. Writing m = αn, we can give a formula for this probability,
namely
 −1 X n/2     −1 n/2−m−1
n n/2 n/2 n X n/2
n/2
≤2 ,
n/2 r=m+1
r n−r n/2 r=0
r

where the equality follows by replacing r with n/2 − r and using the fact that nk = n−k n
 

for any n and k.


Now n/2−m−1 = βn/2 for  a constant β that is less than 1/2. Therefore, by Lemma 2.9,
each binomial coefficient n/2 r
in the above sum is exponentially small compared with 2n/2 ,
which implies that the entire sum is an exponentially small fraction of 2n , and therefore the
entire probability is exponentially small. In principle we could obtain an explicit bound
for the exponent, but we content ourselves here with stating that there is a constant c that
depends on α (it will be (α − 1/4)2 up to a constant) such that the probability is at most
e−cn .
By symmetry, we now know that if we pick sets A1 , . . . , AN of size n/2 independently at
random, then for each i < j the probability that |Ai ∩ Aj | > αn is at most e−cn . Therefore,
the expected number of bad pairs i < j is at most e−cn N2 ≤ e−cn N 2 /2.


It follows that with non-zero probability the number of bad pairs is at most e−cn N 2 /2.
(Strictly speaking, we are using Proposition 1.1 here.) Therefore, there exist sets A1 , . . . , AN
such that the number of bad pairs is at most e−cn N 2 /2. If we choose N such that this is
at most N/2, we can remove one set from each bad pair and still be left with N/2 sets,
and now there are no bad pairs. Taking N = ecn does the job.

3.1 Associating vectors with sets


We now turn to the cases α = 1/4 and α < 1/4. For these we shall use an extremely
important idea with applications all over combinatorics, which is to treat sets as 01-valued
functions and then to view those functions as living in a Euclidean (or occasionally a more
general) space. Given a set A, we shall write 1A for its characteristic function – that is,
the function that takes the value 1 in A and 0 outside A. An immediate observation we

11
can make is that if A and B are subsets of [n], then
n
1A (i)1B (i) = h1A , 1B i.
X
|A ∩ B| =
i=1

However, it is often convenient to associate with a set not its characteristic function but
its balanced function, which is defined as follows. If A is a subset of a finite set X and
|A| = δ|X|, then for each x ∈ X we define fA (x) to be 1 − δ if x ∈ A and −δ if x ∈/ A. We
call this function balanced because it averages zero: indeed Ex f (x) = (1−δ)δ−δ(1−δ) = 0.
Lemma 3.2. Let A and B be two subsets of [n] of size n/2. Let fA and fB be the balanced
functions of A and B. Then hfA , fB i = |A ∩ B| − n/4
Proof.

hfA , fB i = h1A − 1/2, 1B − 1/2i = |A ∩ B| − |A|/2 − |B|/2 + n/4 = |A ∩ B| − n/4.

In particular, if we have sets A and B of size n/2 and their intersection has size at most
(1/4 − δ)n, then hfA , fB i ≤ −δn. This enables us to prove the following rather surprising
result, which is that if α < 1/4, then the largest possible number of sets of size n/2 that
intersect in at most αn is bounded independently of n.
Theorem 3.3. Let A1 , . . . , Am be subsets of [n] of size n/2 and suppose that |Ai ∩ Aj | ≤
(1/4 − δ)n for every i 6= j. Then m ≤ 1 + 1/4δ.
Proof. Write fi for the balanced function fAi . Then, writing k.k for the Euclidean norm
(that is, kf k2 = hf, f i),
m
X
0≤k fi k2
i=1
X
= hfi , fj i
i,j
X X
= kfi k2 + hfi , fj i
i i6=j

≤ mn/4 − m(m − 1)δn,

where for the last inequality we used Lemma 3.2. The result follows on rearranging.
Of course, this is just (equivalent to) a special case of a more general theorem, which
can be proved in the same way.
Theorem 3.4. Let v1 , . . . , vm be unit vectors in a Euclidean space, let δ > 0, and suppose
that hvi , vj i ≤ −δ for every i 6= j. Then m ≤ 1 + 1/δ.

12
Proof.
X
0≤k vi k2
i
X
=m+ hvi , vj i
i6=j

≤ m − δm(m − 1),

which proves the result.


If δ = 1/k for a positive integer k, then it is possible to find 1 + 1/δ = k + 1 unit vectors
that all have inner products at most −δ (and in fact necessarily equal to −δ by the proof
above). One takes the vertices of a regular simplex of dimension k.
A nice way to construct such an object is to take the standard basis vectors e1 , . . . , ek+1
in Rk+1 and to project them to the subspace orthogonal to the vector (1, 1, . . . , 1), or in
other words to the subspace of vectors whose coordinates add up to zero. This gives us
vectors v1 , . . . , vk+1 where
 1 1 1 1 1 
vi = − ,...,− ,1 − ,− ,...,− ,
k+1 k+1 k+1 k+1 k+1
1
where the 1 − k+1
is in the ith place. If i 6= j, then

k−1 2 2 1
hvi , vj i = 2
− + 2
=− .
(k + 1) k + 1 (k + 1) k+1

We also have that


k 2 1 1 k
kvi k2 = 2
+1− + 2
=1− = .
(k + 1) k + 1 (k + 1) k+1 k+1

Therefore, if we normalize the vectors by setting ui = ((k + 1)/k)1/2 vi , we have k + 1 unit


vectors and hui , uj i = −1/k when i 6= j.
Now let us turn to the question of how many unit vectors we can find in Rn if their
inner products are all at most zero. (This corresponds to the case of sets of size n/2 with
intersections of size at most n/4.) A simple example of such a collection of unit vectors is
±u1 , . . . , ±un , where u1 , . . . , un is an orthonormal basis. This turns out to be best possible,
as we now show. (For this particular question it is of course unimportant that the vectors
are unit vectors – we just need them not to be the zero vector.)

Theorem 3.5. Let x1 , . . . , xm be non-zero vectors in Rn such that hxi , xj i ≤ 0 for every
i 6= j. Then m ≤ 2n, and if m = 2n then there is an orthonormal basis a1 , . . . , an such
that each xi is a multiple of some aj (which implies that amongst the xi there is exactly
one positive multiple and one negative multiple of each aj ).

13
Proof. We prove this by induction on n. The case n = 1 is trivial. Assume now that n > 1
and that m ≥ 2n. Let a1 = x1 /kx1 k and let y2 , . . . , ym be the orthogonal projections of
x2 , . . . , xm to the subspace orthogonal to x1 . Then for each i ≥ 2, yi = xi − hxi , a1 ia1 . If
i 6= j, then
hyi , yj i = hxi , xj i − hxi , a1 ihxj , a1 i.
Since neither of hxi , a1 i and hxj , a1 i is positive, this is at most hxi , xj i, and therefore at
most 0.
We therefore have m−1 vectors y2 , . . . , ym in an (n−1)-dimensional space with hyi , yj i ≤
0 for every i 6= j. Since m − 1 > 2(n − 1), it follows from our inductive hypothesis that
y2 , . . . , ym are not all non-zero. Without loss of generality y2 = 0, which implies that x2 is
a multiple of a1 , and therefore a negative multiple of a1 .
But if x1 is a positive multiple of a1 and x2 is a negative multiple of a1 , then the only
way another vector can avoid having a positive inner product with one of x1 and x2 is if it
is orthogonal to a1 . Therefore, x3 , . . . , xm are at least 2(n − 1) vectors living in the space
orthogonal to a1 , which is (n − 1)-dimensional. By induction we can find an orthonormal
basis a2 , . . . , an of this space with the properties stated in the theorem, and combining this
with a1 creates an orthonormal basis that proves the theorem.
Combining this result with Lemma 3.2 and observing that the balanced function of a
subset of [n] of size n/2 lives in the (n−1)-dimensional subspace orthogonal to (1, 1, . . . , 1),
we deduce immediately that it is not possible to find more than 2(n − 1) sets of size n/2
that intersect in at most n/4. But can we find that many? To do so, we need to find a large
set of orthogonal vectors with ±1 coordinates. (The balanced functions have coordinates
equal to ±1/2, but that is clearly equivalent.)

3.2 Hadamard matrices


Here are two constructions that give such sets. The first is a very simple inductive one.
Let W0 be the 1 × 1 matrix (1) and in general define
 
Wm Wm
Wm = .
Wm −Wm

It is straightforward to check the following facts.


• Wm is a 2m × 2m matrix with ±1 entries.

• The first row of Wm consists entirely of 1s.

• The rows of Wm are orthogonal to each other.


The matrices Wm are called Walsh matrices. If n = 2m , we have an n × n ±1 matrix W
with first row consisting of 1s. If we now take Ai to be {j : Wij = 1} and Bi = {j : Wij =
−1} for i = 2, 3, . . . , n, we obtain 2(n − 1) sets A2 , . . . , An , B2 , . . . , Bn of size n/2 such
that Ai ∩ Bi = ∅ for each i and otherwise any two distinct sets from the collection have

14
intersection of size n/4. (The fact that the sets have size n/2 follows is equivalent to the
fact that the later rows of Hn are orthogonal to the first row.)
 The second construction is algebraic. Recall that if 0 ≤ x < p, then the Legendre symbol
x
p
stands for 1 if x is a quadratic residue, −1 if x is a non-residue, and 0 if x = 0. Let p
be a prime
 of  the form 4m + 3, let n = p + 1 and define the Paley matrix P = Pn by setting
x−y
Pxy = p if 0 ≤ x, y ≤ p − 1 and x 6= y, −1 if x = y, and 1 if either of x, y is equal to
p. In other words, we first create a p × p matrix by setting Pxy to be 1 or -1 according to
whether x − y is or is not a quadratic residue, considering 0 to be a non-residue for this
purpose, and then we attach to it an extra row and column that consist of 1s.
Here is the Paley matrix P8 . Note that if you ignore the last row and column, then it
is constant down diagonals (or even “mod-7 diagonals”).
 
−1 1 1 −1 1 −1 −1 1
−1 −1 1 1 −1 1 −1 1
 
−1 −1 −1 1 1 −1 1 1
 
 1 −1 −1 −1 1 1 −1 1 
 
−1 1 −1 −1 −1 1 1 1
 
 1 −1 1 −1 −1 −1 1 1
 
1 1 −1 1 −1 −1 −1 1
1 1 1 1 1 1 1 1

The proof that the Paley matrix is orthogonal uses some elementary number theory in a
very pleasant way.
P    x+d 
Lemma 3.6. For every prime p and every d 6= 0 mod p we have that x xp p
= −1,
where the summation is over all x mod p.
 
Proof. We shall make use of the fact that the function x 7→ xp is multiplicative.

X x x + d X  x2   1 + dx−1 
=
x
p p x6=0,−d
p p
X 1 + dx−1
 
= .
x6=0,−d
p

Now as x ranges over all values other than 0 or d, 1 + dx−1 ranges over all values other than
1 or 0. But half the non-zero elements mod p are quadratic residues and 1 is a quadratic
residue, so this last sum is equal to -1.

Corollary 3.7. The Paley matrix has orthogonal rows.

15
Proof. The inner product of two distinct rows of the Paley matrix is closely related to the
sum calculated in Lemma 3.6. If neither row is the last, then it is of the form
X  x   x + d   d   −d 
− − + 1,
x6=0,−d
p p p p
 
since when x = 0 we replace xp by -1, and the product of the last entries in the two rows
is
 1.  But p ≡3 mod 4, so -1 is not a quadratic residue mod p,  from
 which it follows that
d
p
= − −d p
. Therefore, by Lemma 3.6 (and the fact that p0 = 0), the inner product
is −1 + 1 = 0.
If one of the rows is the last row, then we obtain the sum
X x
−1 + + 1.
x6=0
p

Since exactly half the non-zero numbers mod p are quadratic residues, this is also 0.
A ±1 matrix with orthogonal rows is called a Hadamard matrix. The construction
of
 Walsh matrices relies on the observation that if H is a Hadamard matrix, then so is
H H
. From that and the construction of Paley matrices, we find that if n is of the
H −H
form 2m (p + 1) for p a prime congruent to 3 mod 4, then there is an n × n Hadamard
matrix. The following question is a famous open problem.

Question 3.8. Let n be a multiple of 4. Does there exist an n × n Hadamard matrix?

The smallest n that is a multiple of 4 for which no Hadamard matrix is known is 668.

4 The sum-product problem


Let A be a set of integers. The sumset A + A is defined to be {x + y : x, y ∈ A}. There
are many very interesting results about the sizes of sumsets and what they tell us about
the sets themselves. Here is a simple one.

Theorem 4.1. Let A be a set of size n. Then |A + A| ≤ 2n − 1, with equality if and only
if A is an arithmetic progression.

Proof. Let A = {a1 , . . . , an } with a1 < · · · < an . Then the 2n − 1 numbers

a1 + a1 , a1 + a2 , . . . , a1 + an , a2 + an , . . . , an + an

form a strictly increasing sequence, so they are all distinct. More generally, given any
sequence
(i1 , j1 ), (i2 , j2 ), . . . , (i2n−1 , j2n−1 )

16
such that i1 = j1 = 1, i2n−1 = j2n−1 = n, and each term in the sequence is obtained by
adding 1 to either the x-coordinate or the y-coordinate of the previous term, then all the
resulting sums ait + ajt form a strictly increasing sequence.
It follows that if |A + A| = 2n − 1, then these sequences are all the same. From that it
follows that ai + aj depends only on i + j. From this it follows that the numbers a1 , . . . , an
form an arithmetic progression, since ai−1 + ai+1 = 2ai for every i strictly between 1 and
n.
We also define the product set A.A to be the set {xy : x, y ∈ A}. If A consists of positive
numbers, then by taking the logarithm of those numbers and applying what we have just
proved for sumsets (after checking that the proof goes through unchanged if a1 , . . . , an are
real numbers, which it does), we see that if |A| = n, then |A.A| ≥ 2n − 1, with equality if
and only if A is a geometric progression.
So far, so simple, but we are a short step away from a famous unsolved problem of
Erdős and Szemerédi.
Conjecture 4.2. For every c > 0 there exists n0 such that for every n ≥ n0 and every set
A of positive integers of size n, either A + A or A.A has size at least n2−c .
Note that the conjecture is saying that either the sumset or the product set is almost
as big as it can be, since trivially the maximum size is at most n(n + 1)/2 (and simple
examples show that this maximum can be attained – for example to obtain a large sumset,
take any sequence a1 , a2 , . . . , an of positive integers such that ai ≥ 2ai−1 for every i).
One reason the conjecture is quite plausible is that sets with small sumsets tend to be
related to arithmetic progressions, which have very large product sets, whereas sets with
small product sets tend to be related to geometric progressions, which have very large
sumsets. However, this does not rule out some kind of strange “intermediate” set with,
say, a sumset and a product set of size at most n1.99 , and indeed, almost nothing is known
about the structure of sets with sumsets that are of this sort of size.
For some years, the best known bound for the problem, due to Jozsef Solymosi, was
that, up to logarithmic factors, either the sumset or the product set was of size at least
n4/3 . The exponent 4/3 has subsequently been improved a few times, each time by just
a tiny amount, and the current record, due to Misha Rudnev and Sophie Stevens, stands
at 4/3+2/1167. But these improved bounds require complicated arguments, whereas the
proof of Solymosi, which we give now, is beautifully simple.
We start with a useful definition and an almost trivial, but for the proof very important,
lemma.
Definition 4.3. Let A be a finite subset of R+ . For each x ∈ R+ , write ρ+ A (x) for the
×
number of pairs (a, b) ∈ A such that a + b = x, ρA (x) for the number of pairs (a, b) ∈ A2
2

such that ab = x, and ρ÷ P(x) for the


A number of pairs (a, b) ∈ A2 such thatPa/b = x. The
×
additive energy of A is x∈A+A ρ+ 2
A (x) , and the multiplicative energy is
2
x∈A.A ρA (x) .
The quotient set A/A of A is the set {x/y : x, y ∈ A}.
Lemma 4.4. The multiplicative energy is also equal to x∈A/A ρ÷ 2
P
A (x) .

17
× 2
P
Proof. The multiplicative energy x∈A.A ρA (x) is equal to the number of quadruples
4
(a, b, c, d) ∈ A such
P that ab÷= cd, which is the number of quadruples such that a/c = d/b,
2
which is equal to x∈A/A ρA (x) .
Solymosi deduces his sum-product theorem from a more precise statement, which is
that for every set A of positive real numbers, |A.A||A + A|2 ≥ |A|4 /4dlog |A|e. This he
proves by obtaining both an upper bound and a lower bound for the multiplicative energy
of A. The lower bound is easier, so we give it first. We shall use the following inequality,
which is immensely useful throughout combinatorics. Because it is so important, we give
two proofs.

Lemma 4.5. Let a1 , . . . , an be real numbers. Then i |ai |2 ≥ n−1 ( i |ai |)2 .
P P

Proof 1. By the Cauchy-Schwarz inequality,


X X X 1/2
|ai | = 1 · |ai | ≤ n1/2 |ai |2 .
i i i

The result follows on rearranging.


Proof 2. Let X be a random variable. Then its variance is

E[X − EX]2 = EX 2 − (EX)2 .

The left-hand side is clearly non-negative, so (EX)2 ≤ EX 2 . Now apply this factP
to the ran-
−1 −2
domP variable that takes the value |ai | with probability n to deduce that n ( i |ai |)2 ≤
−1 2
n i |ai | .

Lemma 4.6. Let A be a set of positive real numbers. Then the multiplicative energy of A
is at least |A|4 /|A.A|.

Proof. The multiplicative energy is x∈A.A ρ× 2


P ×
ρA (x) = |A|2 , since each pair
P
A (x) . Now
(a, b) ∈ A2 contributes 1 to exactly one ρ× ×
A (x). Also, the number of x for which ρA (x) 6= 0
is |A.A|. The result now follows from Lemma 4.5.
The above proof, though simple, is worth digesting fully, since this kind of use of the
Cauchy-Schwarz inequality has a huge number of applications.
We now turn to the upper bound,Pand for this we use Lemma 4.4, which tells us
that we can give an upper bound for m∈A/A ρ÷ 2
A (m) instead, which for convenience we
shall rewrite as m∈A/A ρ÷ −1 2 ÷ ÷ −1
P
A (m ) , which we can do because ρA (m) = ρA (m ) for every
m ∈ A/A.

Lemma 4.7. Let A be a set of positive real numbers. Then


X
ρ÷ −1 2 2
A (m ) ≤ 2|A + A| dlog |A|e.
m∈A/A

18
Proof. We begin by applying another trick that is extremely useful in a number of contexts.
It applies when we have a function defined on a finite set and have difficulty handling it
because it is not sufficiently close to being constant. In such a situation, if the values are
not too spread out and we are prepared to sacrifice a logarithmic factor, we can simply
partition the domain according to the values taken up to the nearest power of 2. This
process is called dyadic decomposition.
Here every value of ρ÷ −1
A (m ) is between 1 and |A|, so if we divide the set {1, 2, . . . , |A|}
up into intervals of the form {x : 2k−1 ≤ x < 2k }, we need at most dlog2 |A|e of those
intervals. It follows by averaging that there exists one of the intervals, call it I, such that
X 1 X
ρ÷ −1 2
A (m ) ≥ ρ÷ −1 2
A (m ) .
dlog2 |A|e
ρ÷
A (m
−1 )∈I m∈A/A

We shall now concentrate our efforts on finding an upper bound for the left-hand side.
Let us think a bit about what ρ÷ −1 2
A (m ) means. The number of (a, b) ∈ A × A such
that a/b = m−1 is the number of points in the intersection of A × A with the line y = mx.
(Note that I am writing A × A for the Cartesian product of A with itself – it does not
stand for the product set.) Therefore, the square of this number is the number of pairs
of such points. That is not very interesting as it stands, but now that we know that the
lines intersect A × A in sets of roughly the same size, we can make it more interesting by
observing that up to a factor of 2 it is equal to the number of pairs of points that can
be obtained by taking one point from one line and one point from another. And if the
points in one line are p1 , . . . , pr and the points in another line are q1 , . . . , qs , then because
the lines have different gradients, the points pi + qj are distinct and they all belong to
(A × A) + (A × A) = (A + A) × (A + A). (Note that this × is a Cartesian product: A × A
is not to be confused with A.A).
Let m1 , . . . , mt be the elements of A/A, in increasing order, for which ρ÷ −1
A (m ) ∈ I,
and for each i let Li be the line y = mi x and let Bi = Li ∩ (A × A). Then the sets Bi + Bi+1
are disjoint (because Bi + Bi+1 lies between Li and Li+1 and the gradients of the Li are in
increasing order) and contained in (A+A)×(A+A). Also, for each i, |Bi +Bi+1 | ≥ |Bi |2 /2,
since |Bi + Bi+1 | =P |Bi ||Bi+1 | and all the Bi have the same size up to a factor of 2.
It follows that t−1 2 2
i=1 |Bi | ≤ 2|A + A| . This is not quite what we want, because we
2
would prefer to include |Bt | on the left-hand side. There is a small trick that deals with
this problem, which is to consider the vertical line that passes through the leftmost point
of Bt , which is a point (a, mt a) for some a ∈ A. The intersection of this line with A × A
contains all points (a, mt b) such that (b, mt b) ∈ Bt . Setting Bt+1 to be this set of points,
we then have that |Bt+1 | = |Bt |, that |Bt+1 + Bt | = |Bt |2 , and that Bt+1 + Bt is disjoint
from the other sets Bi + Bi+1 . We may therefore conclude that
X t
X
ρ÷ −1 2
A (m ) = |Bi |2 ≤ 2|A + A|2 ,
ρ÷
A (m
−1 )∈I i=1

which implies the result, by our choice of I.

19
We have more or less finished the argument.

Theorem 4.8 (Solymosi). Let A be a set of positive real numbers. Then

|A|4
|A.A||A + A|2 ≥ ,
2dlog |A|e

and therefore either |A + A| or |A.A| is at least |A|4/3 /(2dlog |A|e)1/3 .

Proof. By our upper and lower bounds for the multiplicative energy, we find that

|A|4
2|A + A|2 dlog |A|e ≥ .
|A.A|

The result follows.


If A = {1, 2, . . . , n}, then |A + A| ≤ 2n and |A.A| ≤ n2 , so |A.A||A + A|2 ≤ 4n4 . Thus,
the lower bound for |A.A||A + A|2 given above is sharp up to the log factor. Of course,
this does not imply that the lower bound for max{|A + A|, |A.A|} is sharp (and indeed, as
mentioned earlier, it is known not to be).

5 The chromatic number of the Kneser graph


The Kneser graph Gn,k is defined as follows. Its vertex set is [2n + k](n) – the set of all
subsets of {1, 2, . . . , 2n + k} of size n – and two vertices are joined by an edge if and only
if the corresponding subsets are disjoint. We normally think of k as fixed and n tending
to infinity, so two sets of size n that are joined are almost complementary. However, this
is not an assumption in the arguments that follow.
This section concerns the chromatic number of Gn,k , the determination of which was
an open problem for over 20 years until it was solved by Lázsló Lovász in 1978. Kneser
observed that the chromatic number is at most k + 2. Indeed, we can define a colouring as
follows. For i = 1, 2, . . . , k + 1 we let Ci consist of all sets of size n with minimal element
i. Then all the remaining sets to into a colour class Ck+2 . It is trivial that two sets in Ci
intersect if i ≤ k + 1. If they both belong to Ck+2 , then they are n-element subsets of the
set {k + 2, k + 3, . . . , 2n + k}, which has size 2n − 1, and therefore we see again that they
intersect. By the definition of the Kneser graph, that implies that C1 , . . . , Ck+2 is indeed
a proper colouring.
Lovász’s proof is regarded as something of a milestone because it introduced the idea
of using topological methods to solve combinatorial problems, which at the time was a
big surprise. With the benefit of hindsight one can reduce this surprise by introducing
topological methods in a simpler context that bears some resemblance to the problem we
wish to solve. The main tool he used was the Borsuk-Ulam theorem, so let us begin by
looking at that.

20
5.1 The Borsuk-Ulam theorem and some variants
The Borsuk-Ulam theorem is the following statement.
Theorem 5.1. Let f : S n → Rn be a continuous function. Then there exists x ∈ S n such
that f (x) = f (−x).
It is equivalent to the following result, which is itself sometimes referred to as the Borsuk-
Ulam theorem.
Theorem 5.2. Let A1 , . . . , An+1 be closed subsets of S n with union equal to S n . Then
there exist x ∈ S n and i ∈ {1, 2, . . . , n + 1} such that x ∈ Ai and −x ∈ Ai .
Proof of equivalence. Assume Theorem 5.1, and let A1 , . . . , An+1 are closed subsets of S n
with union S n . Define a function f : S n → Rn+1 by f (x) = (d(x, A1 ), . . . , d(x, An )). Note
that the image of f lies in the subset of Rn+1 that consists of all vectors with non-negative
coordinates such that at least one coordinate is zero, the second property following from
the fact that the sets Ai cover S n . This set is homeomorphic n
Pto R , as can be seen by
taking the orthogonal projection from it to the subspace {y : i yi = 0}.
By Theorem 5.1, there therefore exists x such that f (x) = f (−x). But since the union
of the Ai is S n , there exists i such that x ∈ Ai , and therefore d(x, Ai ) = 0, and therefore
d(−x, Ai ) = 0. Since Ai is closed, this implies that −x ∈ Ai .
Conversely, suppose we can find a continuous function f : S n → Rn for which f (x) is
never equal to f (−x). Define a continuous function g : S n → S n−1 by
f (x) − f (−x)
g(x) =
kf (x) − f (−x)k2
and note that g(−x) = −g(x) for every x. Now there exist n closed sets B1 , . . . , Bn+1 that
cover S n−1 without any of them containing a pair of antipodal points. For example, take
a regular n-dimensional simplex and project its faces on to the sphere in the obvious way.
Since each face is part of an affine hyperplane that does not contain 0, its projection lies
strictly inside a hemisphere and therefore does not contain a pair of antipodal points. But
then the n + 1 sets g −1 (Bi ) are closed and cover S n , and do not contain a pair of antipodal
points (because g(−x) = −g(x)), contradicting Theorem 5.2.
This in turn implies (and as we shall soon see, is equivalent to) the same statement but
about open sets.
Theorem 5.3 (Open-sets version). Let A1 , . . . , An+1 be open subsets of S n with union
equal to S n . Then there exist x ∈ S n and i ∈ {1, 2, . . . , n + 1} such that x ∈ Ai and
−x ∈ Ai .
Theorem 5.2 implies Theorem 5.3. For each i let fi (x) = d(x, Aci ) and let f (x) = maxi fi (x).
Then each fi is continuous, and therefore so is f . Also, since every x belongs to some Ai
and every Ai is open, f (x) > 0 for every x. By compactness, it follows that there is some
 > 0 such that f (x) ≥  for every x.

21
Now for each i define Bi to be {x : d(x, Aci ) ≥ }. Then each Bi is a closed subset of Ai
and the union of the Bi is S n . Therefore, by Theorem 5.2 we can find i and x such that
x ∈ Bi and −x ∈ Bi , which proves the result.
We now show that the open-sets version implies the following “mixed” version.

Theorem 5.4 (Open or closed version). Let A1 , . . . , An+1 be subsets of S n , each of which
is open or closed, with union equal to S n . Then there exist x ∈ S n and i ∈ {1, 2, . . . , n + 1}
such that x ∈ Ai and −x ∈ Ai .

Theorem 5.3 implies Theorem 5.4. We prove this by induction. Suppose that A1 , . . . , At
are closed and At+1 , . . . , An+1 are open. If t = 0 then the result is Theorem 5.3 so we
are done by hypothesis. Otherwise, suppose that At does not contain a pair of antipodal
points. Define a function f by f (x) = max{d(x, At ), d(−x, At )}. Then f is continuous
and never takes the value zero, so is bounded below by some  > 0. Write (At ) for the
set {x ∈ S n : d(x, At ) < }. Then (At ) is open, contains At and does not contain a
pair of antipodal points. But then by our inductive hypothesis at least one of the sets
A1 , . . . , At−1 , At+1 , . . . , An+1 contains a pair of antipodal points and we are done.
Note that this last result includes the case where all the sets are closed, so all the results
we have proved so far are equivalent. It may look as though the last theorem is of interest
principally as an inductive step useful for getting from the all-open version to the all-closed
version, but actually we shall be using one of the intermediate cases.
To get a more intuitive feel for why the mixed result is true, it is helpful to consider
the case d = 1 and assume that the sets C1 and C2 are both intervals, or rather circular
arcs. If they could be half-open intervals, then it would be easy to ensure that neither
contained a pair of antipodal points – for example, if we identify the unit circle with the
interval [0, 2π) in the usual way, then C1 could be the interval [0, π) and C2 the interval
[π, 2π). But if one of the intervals is required to contain both end points, then we can no
longer do this, or anything like it.

5.2 What has the Borsuk-Ulam theorem got to do with combi-


natorics?
You may possibly have encountered the beautiful problem of proving that there exists a
triangle-free graph with chromatic number at least 2020 (or whatever the year happens to
be). There are several different approaches to this problem, of which one uses so-called
sphere graphs. These are graphs where the vertex set is the unit sphere in Rd for some d,
and two vertices are joined if and only if their inner product satisfies some given inequality
(or equivalently the angle between them satisfies some given inequality).
One simple proof of the result uses the open-sets version of the Borsuk-Ulam theorem.
Let us state it slightly differently to bring out its relevance to colouring: if we colour the
sphere S d with d open colours (allowing points to have more than one colour), then there

22
will necessarily be a pair of antipodal points that have the same colour (or more precisely
have a colour in common).
By slightly modifying this statement, we can remove any topological requirements on
the colours.

Theorem 5.5. Let δ > 0 and define a graph on S d by joining vectors u and v if hu, vi <
−1 + δ. Then this graph has chromatic number at least d + 2.

Proof. If u and v are unit vectors, then ku+vk2 = 2+2hx, yi. It follows that hu, vi < −1+δ
if and only if ku + vk2 < 2δ (which is saying that the distance from u to −v is less than

2δ). p
Now let C1 , . . . , Cd+1 be sets that partition S d , let  = δ/2, and for each i define
Ui to be the set of all points x such that d(x, Ci ) < . Then Ui is open for each i, so by
Theorem 5.3 there exists i such that Ui contains a pair x, −x of antipodal points. By the
definition of Ui , Ci contains points u, v with ku√− xk <  and kv + xk < , which implies
by the triangle inequality that ku + vk < 2 = 2δ, which as we saw above implies that u
and v are joined by an edge. Therefore, there is no proper colouring with fewer than d + 2
colours.

Remark 5.6. If δ is small enough, then the bound above is sharp. We have basically seen
a construction that demonstrates this in the previous subsection: we can take d + 2 unit
vectors u1 , . . . , ud+2 in Rd+1 that form the vertices of a regular simplex, and let Ci be the
set of x such that hx, ui i > θ. Provided θ is at most about d−1 , these sets cover all of S d ,
and the largest distance between two vectors in Ci is slightly less than 2. I’ll leave it as an
exercise to work out the details more precisely.

Remark 5.7. If δ is small, then two points that are joined to each other must be almost
antipodal. This clearly implies that any odd cycle in the graph must be quite long. (I’ll
leave it as another exercise to prove a precise statement along these lines.) In particu-
lar, this construction gives us a big supply of graphs with no short odd cycles and high
chromatic number.

5.3 Proof of Kneser’s conjecture


Now that you have seen that argument, it should come as less of a surprise that the Borsuk-
Ulam theorem will play a central role in the solution of Kneser’s conjecture. Indeed, we
can think of the Kneser graph as rather like a discrete version of the sphere graph just
discussed, since the condition that two sets are joined by an edge is quite similar to the
condition that two points should be almost antipodal. However, because of the discreteness
of the Kneser graph, is it is not completely obvious how to turn that thought into a rigorous
proof. Lovász was the first to do it, and soon afterwards Imre Bárány surprised everybody
by finding a much shorter proof (still based on the Borsuk-Ulam theorem). It was thought
for a long time that Bárány’s proof was the “right” argument, so it came as yet another
surprise when almost a quarter of a century later Joshua Greene, who was a student at

23
the time, came up with the following even shorter proof. He began by observing Theorem
5.4, the “mixed” version of the Borsuk-Ulam theorem we proved in the previous section.
And then came the crazily short proof of Lovász’s theorem.

Theorem 5.8. The chromatic number of the Kneser graph Gn,k is k + 2.

Proof. We are free to choose any set of size 2n + k as the ground set, so let us choose 2n + k
points in general position in the sphere S k , which we regard as a subset of Rk+1 , where
the key property we require is that no k + 2 points are contained in any proper subspace
of Rk+1 . Let E be the set of points chosen.
Let C1 , . . . , Ck+1 be sets that partition E (n) , and let φ : [2n + k] → S k be a map whose
image is in general position. In particular, the property we need is that no k + 2 points of
the image should lie in a proper subspace of Rk .
For each i, let Xi ⊂ S k be the set of all x such that the open hemisphere H(x) = {y :
hx, yi > 0} contains a set A that belongs to Ci . Also, let X0 = S k \ (X1 ∪ · · · ∪ Xk+1 ). The
sets Xi with 1 ≤ i ≤ k + 1 are open, so X0 is closed. Therefore, by Theorem 5.4 some Xi
contains a pair of antipodal points.
If this happens for i ≥ 1, then since H(x) and H(−x) are disjoint, we find that Ci
contains two disjoint sets and we are done. If it happens for i = 0, then the hemispheres
H(x) and H(−x) do not contain any sets of size n, so there must be at least k + 2 points
in the subspace {y : hx, yi = 0}, which contradicts the fact that the points are in general
position.

6 Marcus and Tardos’s solution to the Stanley-Wilf


conjecture
Let π ∈ Sk be a permutation, let n ≥ k, and let σ ∈ Sn be another permutation. Given any
k numbers x1 < · · · < xk ∈ [n], we can obtain a new sequence σ(x1 ), . . . , σ(xk ) and let the
ordering of σ(x1 ), . . . , σ(xk ) define a permutation τ of [k] in a natural way: for each r ≤ k
we define τ (r) to be s if σ(xr ) is the sth smallest element of the set {σ(x1 ), . . . , σ(xk )}. If
we can find x1 , . . . , xk such that τ = π, then we say that σ contains π.
An equivalent and more economical way of defining containment is to say that σ contains
π if there exist x1 < · · · < xk such that σ(xi ) < σ(xj ) if and only if π(i) < π(j).
For example, let π be the permutation 2413 in S4 . (Here the notation simply means
that 1 maps to 2, 2 maps to 4, 3 maps to 1 and 4 maps to 3: it is not supposed to
represent the 4-cycle (2413).) Then the permutation 251983467 contains π, as we see from
the subsequences 2513, 2514, 5836, 5837, 5936, and 5937, but not, for example, from the
subsequence 2536.
If σ is a permutation that does not contain π, we say that σ is π-avoiding. An example
of a permutation that avoids 2413 is 152637489. More generally, suppose that σ ∈ Sn ,
0 < r < n, and there is a partition of [n] into a set A of size r and a set B of size n − r
such that σ maps A to {1, 2, . . . , r} in the unique order-preserving way and maps B to

24
{r + 1, . . . , n} in the unique order-preserving way. Then σ cannot contain 2413, since in
order to find x2 < x3 such that σ(x2 ) > σ(x3 ) it is necessary to take x2 ∈ B and x3 ∈ A.
But then x1 cannot be in A since we would have σ(x1 ) < σ(x3 ), and x4 cannot be in B for
similar reasons. But that implies that σ(x1 ) > σ(x4 ).
The number of permutations we have just constructed is of size 2n , since the set A
determines the permutation uniquely. However, these are not the only ones: they were
enumerated exactly in a paper of Miklós Bóna in 1997. Nevertheless, it turns out that
there are only exponentially many. Since n! is around (n/e)n , so superexponential, this
shows that avoiding the permutation 2413 is a very strong condition.
The Stanley-Wilf conjecture was that for every permutation π, there exists a constant
C = C(π) such that the number of permutations in Sn that avoid π is at most C n for
every n. It was posed in the late 1980s, received a lot of attention, and was finally solved
by Adam Marcus and Gábor Tardos in 2004 with a remarkably simple argument.
Actually, Marcus and Tárdos solved a different problem that was known to imply the
Stanley-Wilf conjecture, so let us first discuss that problem and its solution and then see
how the implication works.
The problem solved by Marcus and Tárdos concerns a closely related notion of contain-
ment that concerns 01-matrices. We say that an n × n matrix A contains a k × k matrix
B if we can find x1 < · · · < xk and y1 < · · · < yk such that Axi yj = 1 whenever Bij = 1.
Two important things to note about this definition are that we do not insist that Axi yj = 0
if Bij = 0, and that the order matters. The second point is particularly important, so let
us look at it a different way. We can regard A and B as adjacency matrices of bipartite
graphs G(A) and G(B) (so for example G(A) has two vertex sets X, Y that are copies of
[n], with i ∈ X joined to j ∈ Y if and only if Aij = 1). Then the condition that A contains
B is saying more than that G(A) contains G(B) as a subgraph: it is saying that there is an
embedding of G(B) into G(A) in such a way that the orderings of the two sets of vertices
are preserved.  
1 0 1 0 0 0  
0 1 0 0 0 1 1 0 1
 
As an an example, the matrix  1 0 0 0 0 0 contains the matrix 0 0 1 , as
  
0 0 0 0 1 1 0 1 0
0 0 0 1 1 0
is shown by the entries that are highlighted in red.
As with permutations, we say that a matrix A avoids a matrix B if it does not contain
B.
Of particular interest is the case where B is a permutation matrix. We can ask how
many n × n matrices avoid B, but actually the question Marcus and Tárdos looked at was
how many non-zero entries such a matrix can have. It had been conjectured by Füredi and
Hajnal that for any given B the number was at most linear in n. This was the conjecture
that Marcus and Tárdos proved.

Theorem 6.1 (Marcus, Tárdos). Let P be a permutation matrix. Then there exists a
constant C = C(P ) such that every n × n 01-matrix that avoids P has at most Cn non-

25
zero entries.

Before we move on to the proof, let us briefly look in a slightly different way at what it
means to contain a permutation matrix. Define an ordered partition of [n] to be a partition
into sets X1 , . . . , Xm such that if i < j, then every element of Xi is less than every element
of Xj .
A matrix A contains a permutation matrix P ∈ Sk if and only if there exist ordered
partitions R1 , . . . , Rk and C1 , . . . , Ck of the rows and columns of A such that if Pij = 1 then
there is a non-zero entry in the submatrix of A given by the rows in Ri and the columns
in Cj . To illustrate this with a diagram,
 
  1 1 0 0 1
1 0 0 0 0 0 0 1
 
0 0 1 is contained in 0 1 0 1 0
 
0 1 0 0 0 1 0 1
0 0 0 0 1

because each of the red blocks contains at least one 1. (Note that other blockings would
have worked just as well.) Thus, we can think of matrix containment as saying that the
small matrix is a “subset of a quotient” of the big matrix. We shall see that blockings like
this play an important part in the proof of Marcus and Tárdos.
Let us make this observation more precise. Let A be an n × n 01-matrix, let R1 , . . . , Rk
and C1 , . . . , Ck be as above, and let B be the k × k matrix obtained by setting Bij = 1 if
and only if there is a 1 in the block Ri × Cj – that is, a 1 whose row belongs to Ri and
whose column belongs to Cj . We shall call such a matrix B an ordered quotient of A.

Lemma 6.2. Let P be a permutation matrix and let A be an 01-matrix that avoids P .
Then every ordered quotient of A avoids P .

Proof. Let σ ∈ Sk , let P be the corresponding permutation matrix (that is, Pij = 1 if
j = σ(i) and 0 otherwise), and suppose that A has an ordered quotient B that contains
P . Let R1 , . . . , Rt and C1 . . . , Ct be the ordered partitions that determine the quotient.
Let x1 < · · · < xk and y1 < · · · < yk be such that Bxi yσ(i) = 1 for i = 1, 2, . . . , k, and for
each i pick an entry Ari si = 1 such that ri ∈ Rxi and si ∈ Cyσ(i) . Since x1 < · · · < xk ,
y1 < · · · < yk and the partitions are ordered, it follows that r1 < · · · < rk and s1 < · · · < sk .
Therefore, A contains P .

Remark 6.3. If that proof did not seem trivial to you, then I recommend drawing a
diagram, or looking at the diagram of the block matrix given above.

Now let us start the proof in earnest. Let us fix a k × k permutation matrix P and
for each positive integer n let f (n) be the largest possible number of non-zero entries in
an n × n 01-matrix that does not contain P . For convenience, let us assume for the time
being that n is divisible by k 2 . This allows us to take ordered partitions R1 , . . . , Rn/k2 and
C1 , . . . , Cn/k2 of the rows and columns, with each Ri and Cj of size k 2 .

26
Given an n × n 01-matrix A with k 2 |n, let us write Aij for the submatrix obtained by
restricting to the rows in Ri and columns in Cj . Call these k 2 × k 2 submatrices blocks.
The key to the proof is the following definition. We shall call a block wide if at least k
different columns of the block contain a 1, and tall if at least k different rows contain a 1.
The structure of the proof will be to show that if A does not contain P , then there cannot
be too many wide blocks, or too many tall blocks, or too many non-empty blocks that are
neither wide nor tall.

Lemma 6.4. If A does not contain a given k × k permutation matrix  P , then for each
k2
j the number of wide blocks with columns in Cj is at most (k − 1) k and for each i the
2
number of tall blocks with rows in Ri is at most (k − 1) kk .
2
Proof. If there are more than (k − 1) kk wide blocks with columns in Cj , then by the
pigeonhole principle we must be able to find a set K of k columns in Cj and a set B of
k blocks such that every block in B contains a 1 in every column in K. But then the
submatrix of A that is formed out of the blocks in B (placed one on top of another) has
a k × k ordered quotient with 1s everywhere: we split the rows according to which block
of B they belong to, and we split the columns in such a way that each of the chosen k
columns belongs to a different cell of the partition. Therefore, A contains the k × k all-1s
matrix, which trivially implies that it contains P .
The same argument, interchanging the roles of rows and columns, proves the statement
for tall blocks.
Lemma 6.4 allows us to prove a recursion for the number of entries that A can have.

Corollary 6.5. Let P be a k × k permutation matrix, let n be divisible by k 2 , and let f (n)
be the largest number of entries of any n × n 01-matrix that does not contain P . Then
 2
2 k
f (n) ≤ 2k n(k − 1) + (k − 1)2 f (n/k 2 ).
k
2
Proof. The number of wide blocks is at most kn2 (k − 1) kk , by Lemma 6.4, and each such
block contains at most k 4 1s. The same is true of tall blocks. So the number of 1s contained
in wide or tall blocks is at most the first term in the expression above.
Let B be the ordered quotient of A defined by R1 , . . . , Rn/k2 and C1 , . . . , Cn/k2 . Then by
Lemma 6.2, B does not contain P . It follows that B has at most f (n/k 2 ) non-zero entries.
So the number of non-zero blocks of A is at most f (n/k 2 ). We have already counted the
entries that belong to wide or tall blocks. Any other block has at most (k − 1)2 non-zero
entries, so the number of entries not so far counted is at most (k − 1)2 f (n/k 2 ). This proves
the result.
The interesting part of the proof is over: it remains to obtain an upper bound for f (n)
from this recursion.

27
Proof of Theorem 6.1. If P is a k × k permutation matrix, we shall prove by induction on
2
n that f (n) ≤ 2k 4 n kk .
If n ≤ k 2 , then trivially f (n) ≤ k 4 ≤ k 4 n, so we are done.
Otherwise, let m be the largest multiple of k 2 that is less than n. Then, again trivially,
f (n) ≤ f (m) + 2k 2 n. And by our inductive hypothesis and Corollary 6.5,
 2  2
2 k 2 4m k
f (m) ≤ 2k (k − 1)m + (k − 1) .2k 2 .
k k k

But
2k 2 (k − 1) + (k − 1)2 .2k 2 = 2k 3 (k − 1),
so putting this together we find that
 2  2
4 3 k 2 4 k
f (n) ≤ 2(k − k )m + 2k n ≤ 2k n .
k k

Remark 6.6. It is not immediately obvious how to find a matrix with more than (k − 1)2 n
non-zero entries that does not contain every k × k permutation matrix, and for a while it
was conjectured that this should be the right bound. However Jacob Fox, in a very nice
paper, proved that in fact the constant has to be exponential in k for almost all k × k
permutation matrices. So the use of the pigeonhole principle above, which looks quite
inefficient, is in fact not too wasteful.

Let us now turn to the Stanley-Wilf conjecture.

Corollary 6.7. For every permutation π ∈ Sk there is a constant C such that for every n
there are at most C n permutations σ ∈ Sn that avoid π.

Proof. Let P be the matrix of π and let α be such that every n × n 01-matrix with at least
αn non-zero entries contains P . For each n let T (n) be the number of n × n 01 matrices
that avoid P . We claim that T (2n) ≤ 15αn T (n).
To see this, observe first that if we take ordered partitions into sets of size 2 of the
rows and columns of a 2n × 2n matrix that avoids P , then the number of distinct quotient
matrices we can obtain is at most T (n), by Lemma 6.2. But each such quotient matrix has
at most αn non-zero entries, so the number of matrices with quotient equal to any given
n × n matrix is at most 15αn (since there are four entries to choose in each non-empty
block and they cannot all be zero). The recursion follows.
It follows immediately by induction that if n is a power of 2, then T (n) ≤ 15αn . It is
also easy to see that the number of permutations in Sn that avoid π is a non-decreasing
function of n, so for general n we may deduce that T (n) ≤ 152αn .

28
7 Entropy arguments
This section is slightly different from previous ones in that we shall need to develop a little
bit of theory, and only after that will the arguments be very short. However, the theory
consists of a few basic statements, and what really matters is not the proofs of those
statements, but how to use them. To put it another way, I recommend treating those
statements more like axioms than lemmas. To encourage this, I shall do so myself, but
just to give you something to hold on to, which makes some of the axioms more intuitive,
you should think of the entropy H[X] of a discrete random variable X as a real number
that measures the “information content” of X. Roughly speaking, this is how many bits
of information you gain, on average, if you find out the value of X, or equivalently, the
expected number of bits needed to specify X.

7.1 The Khinchin axioms for entropy and some simple conse-
quences
Entropy has the following properties, which are called the Khinchin (or Shannon-Khinchin)
axioms. (I got some of these from a set of lecture notes by Cosma Shalizi, which I recom-
mend for further discussion of the pros and cons of an axiomatic approach.)

0. Normalization. If X takes the values 0 and 1, each with probability 1/2, then
H[X] = 1.

1. Invariance. H[X] depends only on the probability distribution of X. That is, if Y =


f (X) for a function f that is bijective on the values taken by X, then H[Y ] = H[X].

2. Maximality. If X takes at most k distinct values, then H[X] is maximized when X


takes each value with equal probability 1/k.

3. Extensibility. If X is a random variable that takes values in a finite set A and Y is a


random variable that takes values in a set B with A ⊂ B, and if P[X = a] = P[Y = a]
for every a ∈ A (and hence P[Y = b] = 0 for every b ∈ B \ A), then H[Y ] = H[X].

4. Additivity. For any two random variables X and Y , H[X, Y ] = H[X] + H[Y |X],
where X
H[Y |X] = P[X = x]H[Y |X = x].
x

5. Continuity. H[X] depends continuously on the probabilities P[X = x].

Axiom 0 is not really one of Khinchin’s axioms, but the remaining axioms determine H
only up to a multiplicative constant so it is there to fix that constant to a convenient value.
Axioms 1-3 and 5 are rather basic properties of a kind that one might expect, but axiom
4 needs more comment. The quantity H[X, Y ] is simply the entropy (whatever that will
turn out to mean) of the joint random variable (X, Y ). The quantity H[Y |X] is called the

29
conditional entropy of Y given X: it is the average entropy of Y given the value of X.
Note that from its definition and the fact that H takes non-negative values (which implies
that H[Y |X = x] is non-negative for each x), it follows that H[Y |X] is non-negative.
We now prove a sequence of lemmas, most of them very simple.

Lemma 7.1. If X and Y are independent, then H[Y |X] = H[Y ] and H[X, Y ] = H[X] +
H[Y ].

Proof. For each x the distribution of Y given that X = x is the same as the distribution of
Y , so H[Y |X = x] = H[Y ] for every x, by the invariance axiom. (The reason that axiom
is needed is that strictly speaking the random variable Y |X = x takes values of the form
(x, y), where y is a value taken by Y .) It follows that
X X
H[Y |X] = P[X = x]H[Y |X = x] = P[X = x]H[Y ] = H[Y ].
x x

The second statement then follows from the additivity axiom.

Lemma 7.2. If X takes just one value, then H[X] = 0.

Proof. H[X, X] = H[X], by the invariance axiom. But X and (X, X) are independent, so
H[X, X] = 2H[X], by Lemma 7.1.
And another.

Lemma 7.3. Let A ⊂ B, let X be uniformly distributed on A, and let Y be uniformly


distributed on B. Then H[X] ≤ H[Y ], with equality if and only if A = B.

Proof. By extensibility, H[X] is not affected if we regard it as taking values in B. The


inequality then follows from the maximality axiom.
Suppose now that |A| = r, |B| = s, and r < s. If r = 1, then the result follows from
Lemma 7.2, the normalization axiom, and what we have just proved.
Otherwise, denote by X n the An -valued random variable given by n independent copies
of X, and similarly for Y . Then for any n, Lemma 7.1 and induction imply that H[X n ] =
nH[X] and H[Y n ] = nH[Y ].
Now choose n such that rn ≤ sn−1 . Then |An | ≤ |B n−1 |, so

nH[X] = H[X n ] ≤ H[Y n−1 ] = (n − 1)H[Y ],

where the inequality follows from what we have just proved (together with the invariance
axiom). Since H[X] ≥ 1 (again by what we proved above), it follows that H[X] < H[Y ].

And another.

Lemma 7.4. Let X be a random variable and let Y = f (X) for some function f . Then
H[Y ] ≤ H[X].

30
Proof. By the invariance axiom, H[X] = H[X, Y ], since there is a bijection between values
x taken by X and values (x, f (x)) taken by (X, Y ). Therefore, by the additivity axiom,
H[X] = H[Y ] + H[X|Y ].
In an earlier version of these notes, I assumed that H was non-negative, having failed
to see a proof of non-negativity from the axioms. However, Sean Eberhard (a postdoc at
Cambridge) pointed out to me the following argument.
Lemma 7.5. H[X] ≥ 0 for every discrete random variable X that takes values in a finite
set A.
Proof. First let us suppose that there exists n such that pa = P[X = a] is a multiple of
n−1 for every a ∈ A. Let Y be uniformly distributed on [n], let (Ea : a ∈ A) be a partition
of [n] such that |Ea | = pa n for each a ∈ A, and let Z be the random variable where Z = a
if Y ∈ Ea . Then Z and X are identically distributed, so H[Z] = H[X], by invariance.
Now H[Y, Z] = H[Z] + H[Y |Z], by additivity. Also, H[Y, Z] = H[Y ] by Lemma 7.4,
since (Y, Z) depends only on Y . And finally, for each a ∈ A, H[Y |Z = a] is uniformly
distributed on a set of size at most n, so by Lemma 7.3 it follows that H[Y |Z = a] ≤ H[Y ].
This implies that H[Y |Z] ≤ H[Y, Z], and therefore that H[X] = H[Z] ≥ 0.
In the general case, since we can approximate the probabilities pa arbitrarily closely by
multiples of n−1 for a suitably large n, we can apply the continuity axiom to obtain the
same conclusion.
Here is a slightly less simple lemma.
Lemma 7.6. Let X be a random variable that takes at least two values with non-zero
probability. Then H[X] > 0.
Proof. Let A be the set of values taken by X, let α = maxa∈A P[X = a], and for each n
denote by X n the An -valued random variable that is given by n independent copies of X.
Then the maximum probability of any value taken by X n is αn . Since α < 1, for any  > 0
there exists n such that αn < . It follows that we can partition An into two sets E and
F , each of which has probability between 1/2 −  and 1/2 + . Now let Y be a random
variable that takes the value 0 if X n ∈ E and 1 if X n ∈ F . Then H[X n ] = nH[X], by
Lemma 7.1 (and induction), and also H[X n ] = H[Y ]+H[X n |Y ] ≥ H[Y ]. But H[Y ] > 0 for
sufficiently small , by the normalization axiom and continuity. It follows that H[X n ] > 0
and therefore that H[X] > 0.
We end this sequence of lemmas with a result that is often useful. It is sometimes
known as the chain rule for entropy.
Lemma 7.7. Let X1 , . . . , Xk be random variables taking values in a set A. Then
H[X1 , . . . , Xk ] = H[X1 ] + H[X2 |X1 ] + H[X3 |X1 , X2 ] + · · · + H[Xk |X1 , . . . , Xk−1 ].
Proof. By additivity,
H[X1 , . . . , Xk ] = H[X1 , . . . , Xk−1 ] + H[Xk |X1 , . . . , Xk−1 ].
The result therefore follows by induction, with the additivity axiom as the base case.

31
7.2 The number of paths of length 3 in a bipartite graph
Let us now consider the following problem. Suppose that G is a bipartite graph with
finite vertex sets A and B and density α. (The density is defined to be the number of
edges divided by |A||B|.) A labelled P 3 is a quadruple (x1 , y1 , x2 , y2 ) such that x1 , x2 ∈ A,
y1 , y2 ∈ B, and x1 y1 , y1 x2 , and x2 y2 are all edges of G. In other words, it is a path of length
3 in the graph, but we allow degeneracies such as x1 = x2 .
How many labelled P 3s must a bipartite graph with density α contain? If G is bi-
regular, meaning that every vertex in A has degree α|B| and every vertex in B has degree
α|A|, then there are α3 |A|2 |B|2 , since we can choose x1 in |A| ways, then y1 in α|B| ways,
then x2 in α|A| ways, and finally y2 in α|B| ways. We shall now show that this is the
smallest number of labelled P 3s that there can be. The proof will assume that entropy
exists – that is, that there is some H that satisfies the Khinchin axioms. Later we shall
see that this assumption is valid, and that will put in the final piece of the jigsaw.
Before you read the proof, I strongly recommend you try to prove the result for yourself
by elementary means, since it looks as though it ought to be possible (given that the result
is true), but turns out to be surprisingly tricky, and if you haven’t experienced the difficulty,
then you won’t appreciate the power of the entropy approach.
How does one use entropy to prove results in combinatorics? Part of the answer lies in
axiom 2. Suppose that X is uniformly distributed on a set of size k, and Y is a random
variable with H[Y ] ≥ H[X]. then axiom 2 implies that Y takes at least k different values.
Therefore, if we want to prove that a set A has size at least k, one way of doing it is to
find a random variable that takes values in A and has entropy at least H[X].
At first this may seem a very strange approach, since by axiom 2 we know that if there
is such a random variable, then a random variable that is uniformly distributed on A will
also work. If we define f (n) to be the entropy of a random variable that is uniformly
distributed on a set of size n, then all we seem to be doing is replacing the cardinality of
a set by f of that cardinality, which doesn’t look as though it will achieve anything.
However, there is a flaw in that criticism, which is that it might in principle be easier
to obtain a lower bound for the entropy of a carefully chosen distribution on a set A (given
certain assumptions about A) than it is to find a lower bound on the cardinality of A. And
indeed, this turns out to be the case in many interesting situations, including the one at
hand.
We wish to obtain a lower bound for the number of labelled P 3s in a bipartite graph G
of density α, and to do so we shall obtain a lower bound for the entropy of the following
distribution on the set of labelled P 3s, which is not uniform (except when the graph is
regular). We choose an edge x1 y1 (with x ∈ A and y ∈ B uniformly at random, then a
vertex x2 uniformly from the neighbours of y1 , and then a vertex y2 uniformly from the
neighbours of x2 .
Let X1 , Y1 , X2 , and Y2 be the distributions of x1 , y1 , x2 , and y2 , respectively. We now
wish to say something about the entropy H[X1 , Y1 , X2 , Y2 ]. The chain rule (Lemma 7.7)
tells us that it is equal to
H[X1 , Y1 ] + H[X2 |X1 , Y1 ] + H[Y2 |X1 , Y1 , X2 ].

32
Now XX
H[X2 |X1 , Y1 ] = P[X1 = a, Y1 = b]H[X2 |X1 = a, Y1 = b].
a∈A b∈B

But for each fixed b, the distributions of X1 and X2 given that Y1 = b are independent:
the way we choose x2 once we have chosen y1 depends entirely on y1 and not on how y1
was obtained. Therefore, this simplifies to
XX X
P[X1 = a, Y1 = b]H[X2 |Y1 = b] = P[Y1 = b]H[X2 |Y1 = b]
a∈A b∈B b∈B
= H[X2 |Y1 ].

In a similar way, if we know the value of X2 , then the distribution of Y2 is independent of


the values of X1 and Y1 , so

H[Y2 |X1 , Y1 , X2 ] = H[Y2 |X2 ].

So we are interested in finding a lower bound for

H[X1 , Y1 ] + H[X2 |Y1 ] + H[Y2 |X2 ].

By the additivity axiom, we can write this as

H[X1 , Y1 ] + H[Y1 , X2 ] + H[X2 , Y2 ] − H[Y1 ] − H[X2 ].

Notice that the first three terms are the entropies of the distributions of the three edges of
the random labelled P 3. We now make an important observation.

Lemma 7.8. Given a random labelled P 3 from the distribution defined above, the three
edges are all uniformly distributed over all edges.

Proof. The first edge is uniformly distributed by the definition of the distribution. Now
the number of edges is α|A||B| and the number of edges x1 y1 with y1 = b is d(b) (the degree
of b), so the probability that Y1 = b is d(b)/α|A||B|, and the probability that X2 = a given
that Y1 = b is 0 if ab is not an edge and d(b)−1 if ab is an edge. So the probability that
(X2 , Y1 ) = (a, b) is 1/α|A||B| whenever ab is an edge, which is another way of saying that
X2 Y1 is uniformly distributed over all edges. And once we know that, then the same proof
shows that X2 Y2 is uniformly distributed.
We also know that H[Y1 ] and H[X2 ] are at most as big as they would be if Y1 and X2
were uniformly distributed. So if we let X be uniformly distributed over A, Y be uniformly
distributed over B, and E be uniformly distributed over all edges, then a lower bound for
the entropy of (X1 , Y1 , X2 , Y2 ) is

3H[E] − H[X] − H[Y ].

33
Consider now the random variable (X1 , Y1 , X2 , Y2 , X, Y ), where (X1 , Y1 , X2 , Y2 ) is as
before, X is a random element of A, and Y is a random element of B, with X and Y
independent of each other and of (X1 , Y1 , X2 , Y2 ). By the above bound and Lemma 7.1
this random variable has entropy at least 3H[E], which is the entropy of the uniform
distribution over all triples of edges (by Lemma 7.1). From this and Lemma 7.3, it follows
that |A||B| times the number of labelled P 3s is at least the cube of the number of edges,
which is α3 |A|3 |B|3 , and from this we get that the number of labelled P 3s is at least
α3 |A|2 |B|2 , as required.

The statement just proved is a special case of the following famous conjecture of
Sidorenko.

Conjecture 7.9. Let G be a bipartite graph with finite vertex sets X and Y and density α.
Let H be another bipartite graph with vertex sets A and B and let φ be a random function
that takes A to X and B to Y . Then the probability that φ(a)φ(b) is an edge of G for every
edge ab of H is at least α|E(H)| .

In rough terms, this can be thought of as saying that if you want to minimize the number
of copies of H in a bipartite graph of density α, then you cannot do better than to pick the
edges of the bipartite graph independently at random with probability α. The conjecture
has been proved for several classes of bipartite graphs, some by entropy methods, but in
general it remains stubbornly open.

7.3 The formula for entropy


I once wrote a blog post about the above proof, in which I did things differently. There
I defined entropy in a more usual way – by writing down a formula for it – and then I
calculated entropies, or gave bounds in the form of specific numbers. When I started this
section, I decided to try to do it axiomatically, but I wasn’t sure how successful I would be.
Having completed the exercise, I am now completely convinced that it is the right thing to
do, as it makes it much clearer that the proof is comparing the entropy of one distribution
with the entropy of another, rather than merely obtaining a numerical lower bound that
gives the desired answer. Also, the proof in the blog post used Jensen’s inequality, whereas
this proof used the simpler maximality axiom.
So when I now give the formula, I recommend that you resist the temptation to latch
on to it and use it the whole time. It should be a last resort – your proofs will be clearer if
you can avoid it. It’s
√ a little like helping a much younger mathematician to get out of the
habit of replacing 2 by 1.414.... At a certain age, one feels the need to experience the
square root of 2 as a number with a decimal expansion, √ but with experience one comes to
realize that what really matters is the “axioms for 2” which are

1. 2 > 0.

2. ( 2)2 = 2.

34

Of course, sometimes we use facts such as that 2 > 1, but those can be deduced from
the above properties and the ordered-field axioms that the reals satisfy.
With those remarks out of the way, here’s the formula. If X is a discrete random
variable taking values in a set A, then, writing pa for P[X = a], we have
X
H[X] = pa log(1/pa ),
a∈A
P
where the logarithm is to base 2. People often write this as − a∈A pa log(pa ), a practice I
don’t like because it has a kind of clever-clever “You thought I was negative but I’m not!”
aspect to it.
A quick example to help with orientation: if XPis uniformly distributed over a set A
of size n, then the formula tells us that H[X] = a∈A n−1 log n = log n. In particular,
if n = 2k , then the entropy is k, which reflects the intuitive idea that we need k bits of
information to specify an element of A.
It is not hard to prove that this function satisfies the axioms given earlier. Normaliza-
tion, invariance, extensibility and continuity are obvious. Maximality is a simple conse-
quence of Jensen’s inequality: the function log x is concave, so if A is any finite set, then
for any random variable X taking values in A, if we write pa for P[X = a], then the pa are
non-negative and sum to 1, so we have
X X
pa log(1/pa ) ≤ log( pa /pa ) = log(|A|),
a∈A a∈A

which is the entropy of a uniformly distributed random variable taking values in A.


As for the additivity axiom, let X take values in A and Y take values in B and let
pa , pb and pab have obvious meanings (in particular, pab = P[X = a, Y = b]). Then
XX
H[X, Y ] = pab log(1/pab ).
a∈A b∈B

Now pab = pa P[Y = b|X = a], so the right-hand side equals


XX 
pab (log(1/pa ) + log(1/P[Y = b|X = a]) .
a∈A b∈B

But XX X
pab log(1/pa ) = pa log(1/pa ) = H[X],
a∈A b∈B a∈A
and
XX X X
pab log(1/P[Y = b|X = a]) = pa P[Y = b|X = a] log(1/P[Y = b|X = a])
a∈A b∈B a∈A b∈B
X
= pa H[Y |X = a]
a∈A
= Ea∈A H[Y |X = a]
= H[Y |X].

35
Thus, H[X, Y ] = H[X] + H[Y |X].
This proves that there is a function that satisfies the entropy axioms, and that completes
the proof of the lower bound for the number of labelled P 3s.

7.3.1 The axioms uniquely determine the formula.


It turns out that the formula we have given for entropy is the only one that satisfies
the entropy axioms. To show this, we begin by working out H[X] when X is uniformly
distributed. (Here H is any function that satisfies the axioms for entropy. Logarithms are
to base 2 throughout.)

Lemma 7.10. If X is uniformly distributed on a set of size 2k , then H[X] = k.

Proof. Let Y be uniformly distributed on a set of size 2. Then H[Y ] = 1 by the normal-
ization axiom, which implies that H[Y k ] = k by Lemma 7.1 and induction. Since Y k is
uniformly distributed on a set of size 2k , H[X] = k as well, by invariance.

Lemma 7.11. If X is uniformly distributed on a set of size n, then H[X] = log n.

Proof. For each r, X r is uniformly distributed on a set of size nr , and H[X r ] = rH[X].
Therefore, if 2k ≤ nr ≤ 2k+1 , then k ≤ rH[X] ≤ k + 1, by Lemma 7.3. It follows that
k/r ≤ H[X] ≤ (k +1)/r whenever k/r ≤ log n ≤ (k +1)/r. This implies that H[X] = log n
as claimed.

Corollary 7.12.
 Let X take values in a finite set A, with pa = P[X = a]. Then H[X] =
1
P
a pa log pa .

Proof. First let us assume that there exists n such that pa is a multiple of n−1 for every
a ∈ A. As in the proof of Lemma 7.5 let Y be uniformly distributed on [n], let Y be
partitioned into sets Ea , one for each a ∈ A, with |Ea | = pa n, and let us assume that
X = a if and only if Y ∈ Ea (which by the invariance axiom loses no generality).
Then by additivity, H[Y ] = H[X] + H[Y |X]. But by Lemma 7.11 H[Y ] = log n, and
X X X
H[Y |X] = pa H[Y |X = a] = pa H[Y |Y ∈ Ea ] = pa log(pa n),
a a a

where for the last equality we again applied Lemma 7.11. It follows that
 
X X 1
H[X] = log n − pa (log pa + log n) = pa log .
a a
pa

As in the proof of of Lemma 7.5 we obtain the general case by applying the continuity
axiom.

36
7.4 Brégman’s theorem
We shall now prove another result using entropy. For this result the final answer is a
number, so the proof is not quite as purely based on the entropy axioms. However, the
only numerical result needed is Lemma 7.11, the relatively simple result that the entropy
of the uniform distribution on a set of size n is log n.
We begin with a definition of an important combinatorial concept.

Definition 7.13. Let A be an n × n matrix. The permanent of A is the sum


n
XY
per(A) = Aiσ(i) ,
σ∈Sn i=1

where Sn is the symmetric group on {1, 2, . . . , n}.

The permanent of a matrix can be thought of as “the determinant if you accidentally


forget the signs”. Unlike the determinant, which has many equivalent definitions and nice
algebraic properties that allow one to calculate it efficiently, the permanent is extremely
hard to calculate (a theorem of Leslie Valiant states that the problem is #P hard, which
makes it at least as hard as any problem in NP, and possibly harder), so it may seem a
little surprising that it is ever studied. But one hint that it is a natural concept, even if
less fundamental than the determinant, is that if A is the bipartite adjacency matrix of a
bipartite graph with both vertex sets X and Y labelled with the integers {1, 2, . . . , n}, then
per(A) is the number of perfect matchings in that graph: that is, the number of bijections
σ : X → Y such that xσ(x) is an edge for every x ∈ X.
Given that one cannot easily calculate the permanent of a matrix, attention turns
naturally to finding bounds. Brégman’s theorem is a general upper bound for 01-matrices.

Theorem 7.14. Let A be an n × n 01-matrix and for each i let ri be the number of 1s in
row i. Then n
Y
per(A) ≤ (ri !)1/ri .
i=1

Note a simple special case: if A consists entirely of 1s, then the right-hand side gives
the nth power of (n!)1/n , which equals n!, which is the number of perfect matchings in
the complete bipartite graph Kn,n . So  the boundis sharp in this case. Note also that
A1 0
if A is a matrix that splits into blocks then per(A) = per(A1 ) per(A2 ) and
0 A2
if we write f (A) for the upper bound in Brégman’s theorem, then we also have that
f (A) = f (A1 )f (A2 ). So this gives us a reasonably wide class of matrices for which the
bound is sharp.
We now give a proof, due to Jaikumar Radhakrishnan, that uses entropy. The theorem
has a number of proofs, some quite short, so we do not claim that the use of entropy is
essential: nevertheless it provides an interesting illustration of the technique.

37
Entropy proof of the theorem. Let G be the bipartite graph with bipartite adjacency ma-
trix A: that is, the vertex sets of G are two copies X, Y of [n] and xy is an edge of G if and
only if Axy = 1. Let σ be a perfect matching chosen uniformly at random from all perfect
matchings in G. We shall obtain an upper bound for the entropy H[σ]. (To be clear, σ
is a random variable with values in Sn and H[σ] is the entropy of that random variable.)
Brégman’s
Q theorem is equivalent to the assertion that the number of perfect matchings is
at most x∈X (d(x)!)1/d(x) , where d(x) is the degree of x, which in turn is equivalent to the
inequality
X 1
H(σ) ≤ log(d(x)!).
x∈X
d(x)

Let x1 , x2 , . . . , xn be an enumeration of the vertices of X. (We can’t just take the


obvious enumeration here, since later we shall average over all enumerations – this is a key
idea of the proof.) Then the chain rule (Lemma 7.7) tells us that

H[σ] = H[σ(x1 )] + H[σ(x2 )|σ(x1 )] + · · · + H[σ(xn )|σ(x1 ), . . . , σ(xn−1 )].

We now bound the quantities on the right-hand side, starting with H[σ(x1 )]. We know
that σ(x1 ) must be one of the d(x1 ) neighbours of x1 in G. The probability that any given
neighbour is chosen depends on the matrix A in a horribly complicated way, but since σ(x1 )
can take only d(x1 ) different values, we can at least bound H[σ(x1 )] above by the entropy
of the uniform distribution on a set of size d(x1 ), which by Lemma 7.11 is log(d(x1 )).
Now let us turn to H[σ(x2 )|σ(x1 )]. If we are given σ(x1 ), then we know two things
about σ(x2 ): that it is a neighbour of x2 , and that it is not equal to σ(x1 ). For each x and
σ, let us define dσ1 (x) to be the number of neighbours of x not equal to σ(x1 ). Then an
upper bound for H[σ(x2 )|σ(x1 )] is Eσ log(dσ1 (x2 )).
More generally, for each k, if we are given σ(x1 ), . . . , σ(xk−1 ), then for each x, let dσk−1 (x)
be the number of neighbours of x not equal to any of σ(x1 ), . . . , σ(xk ). Then an upper
bound for H[σ(xk )|σ(x1 ), . . . , σ(xk−1 )] is Eσ log(dσk−1 (xk )), so
n
X
H[σ] ≤ Eσ log(dσk−1 (xk )).
k=1

At this point we seem to be facing a serious difficulty, which is that it is hard to say
anything about the distribution of dσk−1 , and hence hard to bound the expectations we need
to bound. However, we still have a card up our sleeve, mentioned above, which is that we
can average over all orderings x1 , . . . , xn of [n].
Let σ be fixed, and let x1 , . . . , xn be a random ordering. Let x be a vertex in X and
let us think about the contribution x makes to the sum
n
X
log(dσk−1 (xk )),
k=1

which is what we wish to estimate.

38
Let the neighbours of x be y1 , . . . , yd . Then the ordering of the vertices σ −1 (y1 ), . . . , σ −1 (yd ),
which all belong to X, is distributed uniformly amongst all d! possible orderings, and the
position of x (which is σ −1 (yj ) for some j) in this ordering is therefore uniformly distributed
amongst the d possibilities. If x comes in position s in the ordering, and k (which depends
on the ordering) is such that xk = x, then dσk−1 (xk ) = d − s + 1. Thus, for any fixed σ,
dσk−1 (xk ) is uniformly distributed in the set {1, 2, . . . , d(x)}. It follows that

n d(x)
X X 1 X X 1
E log(dσk−1 (xk )) = log(d(x) + s − 1) = log(d(x)!),
k=1 x∈X
d(x) s=1 x∈X
d(x)

where the expectation is over all orderings.


The above computation was for a fixed σ. But since the answer is independent of σ,
we may conclude that
n
X X 1
EEσ log(dσk−1 (xk )) = log(d(x)!),
k=1 x∈X
d(x)

where again P the first expectation is over all orderings.


Since Eσ nk=1 log(dσk−1 (xk )) was an upper bound for H[σ] for every ordering x1 , . . . , xn ,
the result follows.

7.5 Shearer’s entropy lemma


We begin this section by proving a further useful fact about entropy. When I first wrote
this section of the notes, I didn’t manage to find an axiomatic proof so resorted to the
formula. Subsequently, Chris West, a wonderful double bass player who read mathematics
at Cambridge and has been following this course on YouTube, notified me that he had found
an axiomatic proof. So here are two proofs, the first a boring one that uses the formula,
and the second Chris West’s much nicer axiomatic proof (which is almost certainly known,
but I haven’t found it anywhere online).

Lemma 7.15. Let X and Y be discrete random variables. Then H[X, Y ] ≤ H[X] + H[Y ].

Proof using the formula. Since H[X, Y ] = H[Y ] + H[X|Y ], by the additivity axiom, this
is equivalent to the assertion that H[X|Y ] ≤ H[X], which makes intuitive sense, since
knowing the value of Y shouldn’t increase the amount of information we obtain from X.
To prove it, let A be the set of values taken by X, let B be the set of values taken by Y ,

39
and write pa for P[X = a], qb for P[Y = b], and pa,b for P[X = a, Y = b]. Then
X
H[X|Y ] = qb H[X|Y = b]
b
X pa,b 
X qb
= qb log
b a
q b pa,b
X X pa,b  
qb
= pa log .
a b
p a p a,b

P pa,b
Now b pa = 1, so by Jensen’s inequality this last expression is at most
!  
X X pa,b qb X 1
pa log . = pa log = H[X].
a b
p a pa,b a
pa

Axiomatic proof. First consider the case where Y is uniformly distributed. Let notation
be as in the previous proof. Then
X
H[Y |X] = pa H[Y |X = a]
a∈A

by the definition of conditional entropy. For each a, the random variable [Y |X = a]


takes values in B, so by maximality has entropy at most H[Y ], from which it follows that
H[Y |X] ≤ H[Y ], which is equivalent to the assertion that H[X, Y ] ≤ H[X] + H[Y ], and
also to the assertion that H[X|Y ] ≤ H[X].
Now suppose that qb is rational for every b ∈ B. Then we can find a positive integer n
such that every qb is of the form mb /n for a non-negative integer mb . Now define a random
variable Z as follows. For each b, let Eb be a set of size mb , and let these sets be disjoint.
If Y = b, let Z be chosen uniformly from Eb , and let this be done in such a way that
[Z|Y = b] and [X|Y = b] are independent.
Then by this independence and by what we have already proved,

H[X|Y ] = H[X|Y, Z] = H[X, Y, Z] − H[Y, Z] = H[X, Z] − H[Z] = H[X|Z] ≤ H[X],

which implies that H[X, Y ] ≤ H[X]+H[Y ]. The general case now follows by continuity.
Although we shall not use it in this course, it is worth knowing a definition that arises
naturally from the above fact. The mutual information I[X; Y ] is defined to be H[X] +
H[Y ] − H[X, Y ]. Lemma 7.15 implies that the mutual information is non-negative. Other
simple observations are that if X = Y (or more generally X determines Y and vice versa)
then I[X; Y ] = H[X], that if Y = f (X), then I[X; Y ] = H[Y ], and that if X and Y are

40
independent, then I[X; Y ] = 0. One can think of the mutual information as being the
amount of information that is “shared” by X and Y . The formula

H[X, Y ] = H[X] + H[Y ] − I[X; Y ],

which follows from the definition, is highly reminiscent of the first case

|A ∪ B| = |A| + |B| − |A ∩ B|

of the inclusion-exclusion formula.


Starting with Lemma 7.15 and applying induction in the obvious way, we deduce that
entropy has the following more general subadditivity property: if X1 , . . . , Xn are random
variables, then
H[X1 , . . . , Xn ] ≤ H[X1 ] + · · · + H[Xn ].
Shearer’s lemma, which has a number of applications, is a slightly less obvious generaliza-
tion of this fact.

Theorem 7.16 (Shearer’s lemma). Let S be a random subset of [n], chosen according to
some probability distribution, and suppose that for every i, P[i ∈ S] ≥ µ. Then for any
discrete random variables X1 , . . . , Xn ,

µH[X1 , . . . , Xn ] ≤ ES H[XS ],

where if S = {i1 , . . . , ik }, then XS is short for the random variable (Xi1 , . . . , Xik ).

Note that if S is a random singleton, chosen uniformly, then µ can be taken to be n−1 ,
and we conclude that n−1 H[X1 , . . . , Xn ] ≤ Ei H[Xi ], which is just a restatement of the
subadditivity just mentioned.
Proof of Theorem 7.16. As in the statement of the theorem, let us write S = {i1 , . . . , ik },
and let this be done in such a way that i1 < · · · < ik . (Note that S is a random set, so the
elements ij are not fixed elements of [n].)
By the chain rule for entropy (Lemma 7.7),

H[XS ] = H[Xi1 ] + H[Xi2 |Xi1 ] + · · · + H[Xik |Xi1 , . . . , Xik−1 ].

Write H[Xi |X<i ] for H[Xi |X1 , . . . , Xi−1 ]. Since conditioning on more variables reduces
entropy, the above equality implies that
X
H[XS ] ≥ H[Xi |X<i ].
i∈S

41
Taking expectations, we deduce that
X
ES H[XS ] ≥ ES H[Xi |X<i ]
i∈S
n
X
= P[i ∈ S]H[Xi |X<i ]
i=1
Xn
≥µ H[Xi |X<i ]
i=1
= µH[X1 , . . . , Xn ],

where the second inequality is by the main hypothesis of the theorem and the last equality
is the chain rule again.
We now give a representative application of Shearer’s lemma. It gives a bound for the
following question: if a graph has m edges, how many triangles can it have? It is natural
to guess that the best construction will be a complete graph on n vertices, at least if we
n n n
  
can find n such that 2 = m. In that case we have 2 edges and 3 triangles. These
are roughly n2 /2 and n3 /6, which would suggest that the number of triangles is at most
something along the lines of (2m)3/2 /6. This is in fact exactly the bound we shall obtain.
But although the result is not unexpected, it is notable that for the proof we do not have
to get our hands in the slightest bit dirty: there are no arguments, for instance, that
“compress” the graph.

Theorem 7.17. Let G be a graph with m edges and t triangles. Then t ≤ (2m)3/2 /6.

Proof. Let T be a triangle chosen uniformly at random from G and let X = (X1 , X2 , X3 )
be some ordering of its vertices. Then X is uniformly distributed on a set of size 6t, so
H[X] = log(6t).
We now let S be a random pair of elements of {1, 2, 3} (with each pair chosen with
probability 1/3. Then we can set µ = 2/3 in Shearer’s lemma, from which we deduce that
2
H[X] ≤ ES H[XS ].
3
It follows (by averaging again) that there is some pair S such that 23 H[X] ≤ H[XS ]. But
the values taken by XS are ordered pairs of vertices that form end-points of edges, so it is
supported on a set of size 2m. It follows that H[XS ] ≤ log(2m). We conclude that
2
log(6t) ≤ log(2m),
3
which is equivalent to the assertion of the theorem.

42
Now for a somewhat more subtle use of Shearer’s lemma. We shall use it to obtain a
bound on the following natural problem. Let G be a family of graphs with vertex set [n].
We say that G is ∆-intersecting if for any two graphs G1 , G2 ∈ G, the intersection G1 ∩ G2
contains a triangle. And now we ask the obvious question: how large can a ∆-intersecting
family be?
As with intersecting families, there is also an obvious construction: just take all graphs
n
that contain some fixed triangle. This gives us a family of size 2( 2 ) /8. In the other
direction, if we just use the fact that any two graphs in a ∆-intersecting family have a
n
non-empty intersection, we get an upper bound of 2( 2 ) /2. (You might ask whether we can
do better by using the fact that the intersection of any two graphs in the family has size
at least 3, but the set of graphs with at least 21 n2 + 2 edges has that property and its size
n
is not that much smaller than 2( 2 ) /2.)
n
It turns out that the bound 2( 2 ) /8 is the best possible – this is a relatively recent result
of David Ellis, Yuval Filmus and Ehud Friedgut – but for a long time the best known bound
n
was 2( 2 ) /4, which we shall prove now, using Shearer’s lemma. This result is due to Chung,
Frankl, Graham, and Shearer himself.
Theorem 7.18. A ∆-intersecting family G of graphs with vertex set [n] has size at most
n
2( 2 ) /4.
Proof. As usual, we let X be an element of G chosen uniformly at random, and our aim
will be to obtain an upper bound for H[X], which will translate directly into an upper
bound for |G|. In the argument that follows, it may help to think of each element of G
as a function from [n](2) to {0, 1} – the characteristic function of the graph – so we are
thinking of X as a random variable (Xe )e∈[n](2) , where Xe = 1 if e belongs to the graph
and 0 otherwise.
To apply Shearer’s lemma, we shall choose S as follows. We first pick a set R ⊂ [n]
uniformly at random, and we let GR be the graph that consists of a clique with vertex
set R and a clique with vertex set [n] \ R. Note that every edge e has a probability 1/2
of being an edge of R. Therefore, if we define XR to be the intersection X ∩ GR , which
corresponds to the random variable (Xe )e∈R , then Shearer’s lemma gives us that
1
H[X] ≤ ER H[XR ].
2
Now we use the fact that G is ∆-intersecting. Since any two graphs in G intersect in at least
a triangle, and since every triangle intersects every graph GR , we find that the restriction
of G to any GR is an intersecting family, and therefore has size at most 2|GR |−1 . Therefore,
H[XR ] ≤ |GR | − 1 for every R. This gives us that

H[X] ≤ 2ER |GR | − 2

Since each element of [n](2) has a probability 1/2 of belonging to GR , the right-hand side
n

is equal to 2 − 2, and this proves the result.

43
It’s worth reflecting a little on what Shearer’s lemma did for us in the above argument.
It was a little bit similar in spirit to the averaging argument that proved the Erdős-Ko-Rado
theorem, in that we showed that a suitably chosen “random part” of our ∆-intersecting
family was not too large, and then we exploited that to get a bound on the size of the whole
family. However, unlike in Katona’s argument, the “random part” was not the intersection
of the family with a random set (such as the set of intervals in a cyclic order) but rather the
projection of the family on to a random set of coordinates, where each coordinate had an
equal probability of belonging to the set. So Shearer’s lemma tells us that if we can find a
nicely balanced set of projections of a set system and can give good upper bounds for their
sizes, then we can get a good upper bound for the size of the whole family. This general
theme of bounding the size of a set in terms of the size of its projections is a significant
one. (See for example the Loomis-Whitney inequality and its generalizations.)

8 The polynomial method


It is not very easy to say exactly what the polynomial method is, but in broad terms it
is exploiting facts about zeros of polynomials to prove results that do not appear to have
anything to do with polynomials. Typically, one tries to prove that a small combinatorial
structure cannot exist by showing that if it did, then it would allow us to define a non-zero
polynomial of low degree that vanishes on a large set, sometimes with some extra structure,
which we can then contradict by proving results that show that non-zero polynomials of
low degree cannot vanish on such sets.

8.1 Dvir’s solution to the Kakeya problem for finite fields


In this section we shall illustrate the method with three results. The first is a celebrated
result of Dvir, who solved the so-called Kakeya problem for finite fields. This is the
following question: suppose that A is a subset of Fnp that contains a translate of every line.
Must A have size pn−o(1) ? Here, we are taking n to be fixed and p to be tending to infinity.
Dvir’s solution gave the following theorem.
Theorem 8.1. Let A be a subset of Fnp that contains a translate of every line. Then
|A| ≥ c(n)pn .
The proof needs a couple of simple (but surprisingly powerful) lemmas.
Lemma 8.2. Let A ⊂ Fnp be a set of size less than n+d

d
. Then there exists a non-zero
polynomial P (x1 , . . . , xn ) of degree d that vanishes on A.
Proof. A polynomial of degree d in the variables x1 , . . . , xn is a linear combination of
monomials of degree at most d. The number of monomials of degree k is the number of
ways of writing a number less than or equal to d in the form a1 +· · · + an with each ai
non-negative. By a standard holes-and-pegs argument, this is n+d d
. (Given n + d holes
with d pegs, then ai is the number of pegs that follow the ith hole.) Therefore, for a

44
polynomial of degree d to vanish at m given points, a certain set of m linear combinations
of the coefficients must all be zero. By dimension considerations, this is possible if m is
less than the number of coefficients. The result follows.
When d = p − 1, this gives us that for every set A of size less than n+p−1 n+p−1
 
p−1
= n
we can find a non-zero polynomial of degree d that vanishes on A. Since n+p−1
 n
p−1
> p /n!,
n
this is in particular true when |A| ≤ p /n!.
The next lemma is the main idea behind the proof of Dvir’s theorem.

Lemma 8.3. Suppose that A ⊂ Fnp contains a line in every direction, that d < p, and that
there exists a non-zero polynomial f of degree at most d that vanishes on A. Then there is
a non-zero degree-d polynomial that vanishes everywhere on Fnp .

Proof. Without loss of generality the degree of f is exactly d. Let a, z ∈ Fnp with z 6= 0 and
let L be the line consisting of all points a + tz. The restriction of f to L is a polynomial
of degree d in t, and its leading coefficient
Qnis fdr(z), where fd is the degree-d
Qn part of fr.i To
see this, observe that for any monomial i=1 xi its value at a + tz is i=1 (ai + tzi ) , so
i

if the monomial has degree d, then to obtain a term in td we must choose tzi from each
bracket, which gives td ni=1 ziri .
Q
Now if f vanishes everywhere on L, then since its dependence on t is given by a
polynomial of degree less than p, all the coefficients of that polynomial must be zero. It
follows that fd (z) = 0. But z was an arbitrary non-zero element of Fnp , and fd vanishes at
zero as well, so it vanishes everywhere.
To finish the proof, we need the second of our simple but powerful lemmas. Note that
there is an important distinction between the zero polynomial, which is the polynomial
with all coefficients of all monomials equal to zero, and a polynomial that takes the value
zero everywhere on Fnp . For instance, the polynomial xp − x is not the zero polynomial but
takes the value zero everywhere on Fp .

Lemma 8.4. Let f be a non-zero polynomial on Fnp of degree less than p. Then f is not
identically zero.

Proof. We shall prove this by induction on n. If n = 1, then a non-zero polynomial that


vanishes everywhere has p roots, so must be of degree p. Essentially the same argument
works for general n. If f vanishes everywhere, then for each a it vanishes on the set x1 = a.
But the restriction of f to that set is a polynomial of degree less than p in the variables
x2 , . . . , xn , so by induction it is the zero polynomial.
In other words, if we substitute a for x1 in the definition of f , we obtain the zero poly-
nomial. If we think of f as a polynomial in x1 with coefficients in Fp [x2 , . . . , xn ], then the
polynomial division algorithm tells us that we can write f (x) in the form P (x1 , . . . , xn )(x1 −
a) + Q(x2 , . . . , xn ), so if this polynomial is the zero polynomial when we substitute x1 = a
we obtain that Q is the zero polynomial and (x1 − a) divides f . But if this is true for every
a, then again we find that f has to have degree at least p, contradicting our assumption.

45
Note that this result is sharp, as can be seen by considering the polynomial p(x) =
xp1 − x1 .
Proof of Theorem 8.1 Combine Lemmas 8.2, 8.3 and 8.4 and the result follows straight
away, with c(n) = 1/n!. 
Since we are most of the way to a proof, we briefly discuss a result called the Schwartz-
Zippel lemma, which tells us that a polynomial of degree d in n variables cannot have
too many roots. Some sort of intuition for how many roots we might expect comes from
thinking about linear polynomials: there we do not get more than pn−1 roots. A very useful
and general principle in applications of arithmetic geometry is that polynomials behave in
a rather similar way to linear functions. For example, a linear function from R to R has
at most one root, while a polynomial of degree d has at most d roots, which is the same
up to a constant.

Lemma 8.5. Let f be a non-zero polynomial of degree at most d on Fnp . Then f has at
most dpn−1 roots.

Proof. Without loss of generality the degree is exactly d. As we have already seen, if we
restrict f to the line consisting of points a + tz, then we obtain a polynomial of degree d
in the single variable t with leading coefficient fd (z), where fd is the degree-d part of f .
By Lemma 8.4 we may choose z such that fd (z) 6= 0. But that means that on every line L
in direction z the restriction of f to L is given by a polynomial in one variable of degree
d. So f has at most d roots in any line, and therefore at most dpn−1 roots in total.
The bound here is sharp as well, as can be seen by considering polynomials that depend
on x1 only. Note also that the Schwarz-Zippel lemma implies Lemma 8.4.
One use of the Schwarz-Zippel lemma appears in theoretical computer science in con-
nection with the general problem of polynomial identity testing. Here one is given two
polynomials P and Q in n variables over a finite field Fp , and the task is to decide whether
they are in fact the same polynomial. If the polynomials are presented as sums of products
of other polynomials, then it may be impractical to determine this by expanding everything
out as a sum of monomials, of which there will be nd if the polynomials have degree d, and
in general the problem appears to be very hard.
However, the Schwarz-Zippel lemma provides a probabilistic algorithm. One selects
r choices for x1 , . . . , xn independently at random and checks whether P (x1 , . . . , xn ) =
Q(x1 , . . . , xn ). If ever they differ, we know that the polynomials are not identical. Con-
versely, if P and Q have degree d and are different, then by the Schwarz-Zippel lemma the
probability that we will not detect this with one of our random choices of x1 , . . . , xn is at
most (d/p)r , since for each choice we have to pick a root of P − Q and the probability of
choosing a root is at most d/p. So by making r large (but not ridiculously large) we can
achieve an extremely high degree of confidence (falling short of total certainty) that either
the two polynomials are the same or we will detect that they are different.

46
8.2 Alon’s combinatorial Nullstellensatz
The next result, due to Noga Alon, concerns polynomials, but has a number of applications
to problems that do not appear to have anything to do with polynomials.
Lemma 8.6. Let f be a non-zero polynomial in n variables over Fp of degree k1 + · · · + kn ,
where the ki are non-negative integers and the coefficient of xk11 . . . xknn is non-zero. Let
S1 , . . . , Sn be subsets of Fp such that |Si | > ki for each i. Then f does not vanish on
S1 × · · · × Sn .
Proof. We prove this by induction on the degree of f . The result is easy when the degree
is zero.
If the degree is greater than zero, then without loss of generality k1 > 0. Let a ∈ S1
and use polynomial division to write f (x) in the form (x1 − a)P (x) + Q(x), where Q does
not depend on x1 . Since the term in xk11 . . . xknn has a non-zero coefficient in f , and the
degree of P is k1 + · · · + kn − 1, the term in xk11 −1 xk22 . . . xknn has non-zero coefficient in P .
Suppose that the conclusion is false. Then f vanishes on {a} × S2 × · · · × Sn , from
which it follows that Q vanishes on this set too. Since Q does not depend on x1 , it vanishes
on all of S1 × · · · × Sn . Therefore, (x1 − a)P vanishes on S1 × · · · × Sn , which implies that
P vanishes on (S1 \ {a}) × S2 × · · · × Sn . By induction it follows that P is also the zero
polynomial, and we are done.
As our first application of the combinatorial Nullstellensatz, we give a short proof of
the Cauchy-Davenport theorem, which is the following result (discovered independently by
Cauchy in 1813 and Davenport in 1935 – apparently it was not until 1947 that Davenport
found out that Cauchy had beaten him to it by over a century). Neither Cauchy nor
Davenport used the method below.
Theorem 8.7. Let p be a prime and let A and B be subsets of Fp . Then |A + B| ≥
min{p, |A| + |B| − 1}.
Proof. Once we have the clue that this can be proved using the combinatorial Nullstellen-
satz, we need to find a suitable sequence of sets S1 , . . . , Sn and a polynomial that vanishes
on S1 × · · · × Sn and that has small degree unless A + B is large.
This gives enough of a clue to complete the proof. The obvious sequence of sets to take,
since the only sets we have around are A and B, is (A, B). We want the degree of our
polynomial to depend on the size of A + B, and we also want it to vanish on A × B. The
most economical way of getting it to vanish at a point (a, b) is to ensure that the polynomial
has a factor x + y − (a + b), which leads to the idea of considering the polynomial
Y
f (x, y) = (x + y − c).
c∈A+B

This vanishes on A × B and has degree equal to |A + B|, so it looks promising.


Suppose now that |A| + |B| ≤ p and |A + B| ≤ |A| + |B| − 2. We want to contradict the
combinatorial Nullstellensatz, so we need a monomial xr y s with non-zero coefficient with

47
r < |A|, s < |B| and r + s = |A + B|. But if we pick any r and s that satisfy the last three
conditions, which we clearly can if |A + B| ≤ |A| + |B| − 2, then the coefficient of xr y s in
the polynomial is r+s

r
, and this is non-zero because p is prime and r + s < p.
If |A| + |B| > p, then for any x the sets A and x − B intersect, so |A + B| = p.
Note that this result is sharp if A and B are arithmetic progressions in Fp with the
same common difference. Note too that the result is false in Zn if n is composite, since
then Zn has proper subgroups
Now we shall use the combinatorial Nullstellensatz to prove a variant of the Cauchy-
Davenport theorem. The result is due to da Silva and Hamidoune, who found a somewhat
involved combinatorial proof. This short argument was discovered by Alon, Nathanson
and Ruzsa. Let us write A+̇B for the set of sums a + b such that a ∈ A, b ∈ B and a 6= b.

Theorem 8.8. Let p be a prime and let A and B be subsets of Zp . Then |A+̇B| ≥
min{p, |A| + |B| − 3}.

Proof. First we show that if |A| + |B| ≥ p + 2, then A+̇B = Zp . If p = 2 the result is
trivial. To see it for p > 2, let x ∈ Zp . Then A ∩ (x − B) contains at least two elements,
so at least one of them, a does not satisfy the equation 2a = x. But then a ∈ A and
b = x − a ∈ B are distinct elements that add up to x.
We would now like to apply the combinatorial Nullstellensatz, so we need to show that
if A+̇B is too small, then some polynomial of low degree vanishes everywhere on a product
set. The natural product set to try to take is A × B. What low-degree polynomial would
vanish on A × B? Well,Q we know that A+̇B is small, so a first approximation would be to
take the polynomial c∈A+̇B (x+y−c). The trouble with that is that it does not necessarily
vanish when x = y ∈ A ∩ B. But we can take care of that by simply multiplying by the
polynomial x − y.
So now we have a polynomial of degree |A+̇B|+1 that vanishes on A×B. For technical
reasons it will be convenient to modify it slightly. Let us assume that the result is false
and let C be a set that contains
Q A+̇B and has cardinality exactly |A| + |B| − 4. Let P
be the polynomial (x − y) c∈C (x + y − c). This polynomial vanishes on A × B and has
degree |A| + |B| − 3.
Let us look at the terms in x|A|−1 y |B|−2 and x|A|−2 y |B|−1 , since if either of these has a
non-zero coefficient then we will have contradicted the combinatorial Nullstellensatz. Let
t−1 t−1
r−1 s−2

us write r = |A|, s = |B|, t = r + s − 3. Then the coefficient of x y is r−2 − r−1
t−1 t−1
r−2 s−1
 
and the coefficient of x y is r−3 − r−2 .
Now it is not possible for three consecutive binomial coefficients to be congruent
 a  mod p.
a a b a b+1

This can be checked by hand if p = 2. Otherwise, b−1 b
= a−b+1
and b b+1
= a−b
,
so we would require b − 1 ≡ a − b ≡ b + 1. It follows that the two coefficients above are
not both zero, so the result is proved.
Note that if A = B = {0, 1, 2, . . . , r − 1}, then A+̇B = {1, 2, . . . , 2r − 3}, so the result
is sharp when the sets have equal size. It is not sharp for general sizes, since for example
if B is a singleton, then |A| + |B| ≥ |A| − 1 = |A| + |B| − 2.

48
The above two results can be proved by other means, but there are results that have
been proved using the combinatorial Nullstellensatz for which no other proof is known –
just as no proof of Dvir’s theorem is known that does not go via polynomials.
It is interesting to note that there is an important difference between Dvir’s proof and
proofs that use the combinatorial Nullstellensatz. The former uses the fact that a non-
zero polynomial of low degree cannot have too many roots. That is, the zero set of a
low-degree non-zero polynomial cannot be too large. The combinatorial Nullstellensatz
places a restriction (under a slightly different hypothesis) on not just the size but also the
structure of the zero set: it cannot be a Cartesian product of sets that are too large.

8.3 The solution to the cap-set problem


The following result is a well-known theorem in additive combinatorics.

Theorem 8.9. For every δ > 0 there exists n such that every subset A ⊂ Fn3 of density at
least δ contains three distinct vectors x, y, z with x + y + z = 0.

Note that the condition x + y + z = 0 is the same as the condition x + z = 2y, so we


can think of such triples as arithmetic progressions in Fn3 . We can also think of them as
one-dimensional affine subspaces.
The theorem is due to Roy Meshulam, though the proof is a straightforward adaptation
of a proof due to Roth of the following closely related result.

Theorem 8.10. For every δ > 0 there exists n such that every subset A ⊂ {1, 2, . . . , n} of
density at least δ contains an arithmetic progression of length 3.

It is natural to ask for more quantitative versions of these results. Meshulam’s proof
tells us that there is an absolute constant C such that the conclusion of the theorem holds
if n ≥ Cδ −1 . Turning that round, this says that if A ⊂ Fn3 is a set of density at least Cn−1 ,
then the conclusion holds. However, the densest known sets that did not contain triples
x, y, z with x + y + z = 0 were far sparser than this – they had exponentially small density
(as a function of n).
Bridging this huge gap became a notorious open problem, known as the cap-set problem,
until a remarkable development in 2016. First Ernie Croot, Seva Lev, and Péter Pál Pach
proved that a subset of Z4n with no progression of length 3 has to be of exponentially
small density, and then within a couple of weeks Jordan Ellenberg and Dion Gijswijt
independently showed that the Croot-Lev-Pach technique could be made to solve the cap-
set problem itself. Not long after that, Terence Tao came up with a related argument
that, unlike the arguments of Ellenberg and Gijswijt, treated the three variables x, y, z
symmetrically. We give Tao’s argument here.
The argument begins with a notion of rank for 3-tensors (that is, 3-dimensional general-
izations of matrices). First, note that if X, Y are finite sets, F is a field, and f : X × Y → F
is a function of two variables, then the rank of f , if we regard it as a matrix, is the minimum

49
k such that we can find functions u1 , . . . , uk : X → F and v1 , . . . , vk : Y → F such that
k
X
f (x, y) = ui (x)vi (y)
i=1

for every (x, y) ∈ X × Y .


If we now look at functions of three variables, there are a number of natural general-
izations of rank. Let f : X × Y × Z → F be a function. Then the most obvious way that
we might try to decompose f into simpler parts is by writing it in the form
k
X
f (x, y, z) = ui (x)vi (y)wi (z).
i=1

Another, slightly less obvious but very natural when one stops to think about it, is to write
it in the form
k
X
f (x, y, z) = ui (x, y)vi (y, z)wi (x, z).
i=1

Both the above definitions have their uses, but Tao introduced an “intermediate” defini-
tion, where instead of using products of one-variable functions or products of two-variable
functions, one takes products of a one-variable function and a two-variable function.
Definition 8.11. Let X, Y and Z be finite sets, let F be a field, and let f : X ×Y ×Z → F.
The slice rank of f is the smallest r such that there exist positive integers 0 ≤ r1 ≤ r2 ≤ r
and functions u1 , . . . , ur and v1 , . . . , vr such that
r1
X r2
X r
X
f (x, y, z) = ui (x)vi (y, z) + ui (y)vi (x, z) + ui (z)vi (x, y),
i=1 i=r1 +1 i=r2 +1

for every (x, y, z) ∈ X × Y × Z.


In the above definition, the functions ui and vi do not all have the same domains – for
instance, if r1 < i ≤ r2 then ui : Y → F and vi : X × Z → F, as is suggested by the use of
the variables. Note also that we use the convention that if r1 = 0 then there are no terms
in the first sum, if r1 = r2 then there are no terms in the second sum, and if r2 = r then
there are no terms in the third sum.
To put the above definition in a more wordy way, we could define a function from
X × Y × Z to be “basic” if it is a product of a one-variable function and a two-variable
function. Then the slice rank of f is the smallest r such that f is a sum of r basic functions.
Now let us prove a lemma that generalizes to slice ranks the fact that the rank of a
diagonal matrix is the number of non-zero entries of that matrix.
Lemma 8.12. Let X be a finite set, let A ⊂ X, let F be a field, and let f : X 3 → F be a
function such that f (x, y, z) 6= 0 if and only if x = y = z and x ∈ A. Then the slice rank
of f is |A|.

50
Proof. Let δa : X → F be defined by δa (x) = 1 if x = a and 0 otherwise. Then by
hypothesis we have that
X
f (x, y, z) = f (a, a, a)δa (x)δa (y)δa (z)
a∈A

for every (x, y, z) ∈ X × Y × Z, which implies that f has slice rank at most |A| (since a
product of two one-variable functions is a two-variable function).
Now let us suppose that we have a decomposition
r1
X r2
X r
X
f (x, y, z) = ui (x)vi (y, z) + ui (y)vi (x, z) + ui (z)vi (x, y).
i=1 i=r1 +1 i=r2 +1

Without loss of generality r1 > 0, so that the first term above is not P empty.
We now claim that there is a function h : X → F such that x∈X h(x)ui (x) = 0 for
i = 1, 2, . . . , r1 and such that h is non-zero outside a set of size r1 . To see this, form an
r1 × |X| matrix M with the functions ui as its rows (that is, M (i, x) = ui (x)). Then we
are trying to find a solution to the equation M h = 0 such that h does not have many
zeros. We can put M into reduced row-echelon form, and then it has s ≤ r1 non-zero
rows, and a set of s columns such that those rows and columns form a copy of the s × s
identity matrix. Let S ⊂ X be the subset corresponding to the set of columns. Then we
may choose arbitrary values h(x) for every x ∈ / S, and then the values of h inside S are
uniquely determined by the equations. The claim follows.
Now consider the function g : Y × Z → F given by the formula
X
g(y, z) = h(x)f (x, y, z).
x

Then g(y, z) = 0 if y 6= z or if y = z ∈ / A. If y = z ∈ A, then it takes the value


h(a)f (a, a, a). Since h is non-zero outside a set of size at most r1 , h(a)f (a, a, a) is non-zero
on a set of size at least |A| − r1 . From this it follows that g has rank at least |A| − r1 .
However, we also know that g(y, z) is given by the formula
r2
X X r
X X
ui (y) h(x)vi (x, z) + ui (z) h(x)vi (x, y),
i=r1 +1 x i=r2 +1 x∈X

which is a sum of r − r1 products of two one-variable functions. This proves that g has
rank at most r − r1 .
It follows that |A| ≤ r, so the slice rank is at least |A|.
The connection between slice rank and the cap-set problem comes with the following
key observation. Suppose that A is a subset of Fn3 that does not contain distinct elements
x, y, z with x + y + z = 0. Then if x, y, z belong to A and are not all the same, we must

51
have that x + y + z 6= 0 and hence that there exists i such that xi + yi + zi 6= 0. And this
last assertion can be written in a polynomial form as

1 − (xi + yi + zi )2 = 0.

Now let f : A3 → F3 be defined by setting f (x, y, z) = 1 if x = y = z ∈ A and 0 otherwise.


Then n
Y
f (x, y, z) = (1 − (xi + yi + zi )2 )
i=1
3
for every (x, y, z) ∈ |A| . By the above lemma, the slice rank of f is |A|, so any upper
bound we can obtain for the slice rank will translate directly into an upper bound for the
size of A.

Lemma 8.13. The slice rank of the polynomial P (x, y, z) = ni=1 (1 − (xi + yi + zi )2 ) is at
Q
most 3M , where M is the number of 012-sequences of length n that sum to at most 2n/3.

Proof. The polynomial P is a polynomial of degree 2n in the 3n variables x1 , . . . , xn ,


y1 , . . . , yn , and z1 , . . . , zn , and it is of degree at most 2 in each variable. It can be written
as a linear combination of monic monomials, that is, polynomials of the form

xa11 . . . xann y1b1 . . . ynbn z1c1 . . . zncn .

Let us now partition the monomials according to which of a1 + · · · + an , b1 + · · · + bn and


c1 + · · · + cn is the smallest (breaking ties arbitrarily when they occur). This expresses P
as a sum P1 + P2 + P3 , where P1 is of degree at most 2n/3 in the xi , P2 is of degree at
most 2n/3 in the yi , and P3 is ofP degree at most 2n/3 in the zi .
Now P1 is a sum of the form j Qj (x)Rj (y)Sj (z), where Qj , Rj and Sj are monomials
and Qj is monic. Let usP now collect these terms according to what Qj is. That enables
us to write P1 as a sum h Th (x)Uh (y, z), where the Th are distinct monic monomials of
degree at most 2n/3. (Each Th is equal to a Qj and each Uh (y, z) is the sum of all the
Rj (y)Sj (z) such that Qj = Th .) Note that the number of possibilities for Th is M , so there
are at most M terms in the sum.
We can do the same with P2 and P3 , and this proves the result.
It remains only to obtain an upper bound for M . For this we can use Lemma 2.6. Let
X1 , . . . , Xn be independent random variables, each of which is uniformly distributed in the
set {−1, 0, 1}, and let X = X1 + · · · + Xn . Then it is not hard to see (by considering
Yi = 1 − Xi ) that M = 3n P[X ≥ n/3], which by Lemma 2.6 is at most 3n e−n/36 , and in
particular is exponentially smaller than 3n . It follows that |A| is also exponentially smaller
than 3n , and we have proved Theorem 8.9 with an exponential bound for the density
required.
One can of course be more careful at the end here and obtain a precise estimate for the
number of 012-sequences that add up to at most 2n/3.

52
9 Huang’s solution to the sensitivity conjecture
Just over a year ago (at the time of writing in late 2020), a young mathematician called
Hao Huang astonished theoretical computer scientists by not merely solving a famous
open problem, known as the sensitivity conjecture, but by doing so with an extraordinarily
short proof. As Scott Aaronson put it on his blog, it is a good illustration of the difference
between the complexity classes P and NP, because the proof is extremely short and easy
to understand, but it took a lot of very good mathematicians a few decades to find it.
I shall not say what the sensitivity conjecture is here, because it had been shown to
follow from a very simply stated problem about the discrete cube, so it is the latter problem
and its solution that I shall discuss. Write Qn for the Hamming cube, that is, the graph
with vertex set {0, 1}n where we join x to y if they differ in exactly one coordinate. The
theorem that Huang proved is the following.
Theorem 9.1. Let G be an induced
√ subgraph of Qn with 2n−1 + 1 vertices. Then the
maximum degree of G is at least n.
Note that the Qn is bipartite, since if two vertices are joined, then the numbers of 1s in
the corresponding sequences have different parity. So the theorem says that if we go from
2n−1 vertices
√ to 2n−1 + 1 vertices, then the smallest possible maximum degree jumps from
zero to n.
To begin the proof, we define a sequence of orthogonal matrices, the inductive con-
struction of which is reminiscent of that of the Walsh matrices we saw earlier in the course.
We start with A0 = (0), and then we define
 
An−1 I
An = .
I −An−1
Here, I is the 2n−1 × 2n−1 identity matrix.
It is easy to see (by induction) that An is symmetric. We also have that its rows are
orthogonal. To see this, note that it is trivial by induction if the two rows are either both
in the top half or both in the bottom half. Otherwise, if we take the ith row from the top
half and the jth row from the bottom half, we obtain the inner product (An−1 )ij −(An−1 )ji ,
which is zero because An−1 is symmetric.
Another feature of the matrix An is that its entries are all −1, 0 or 1 and that there
are n + 1 non-zero entries in each row (and hence also in each column). In fact, we can
say more. If we index the rows and columns not with elements of the set {1, 2, . . . , 2n }
but instead with 01-sequences of length n, where the first half of the indices for An are
obtained from those of An−1 by adding a 0 on the end, and the second half are obtained
by adding a 1 on the end, then (An )xy is non-zero only if x and y differ in exactly one
coordinate. Indeed, if x and y share their last coordinate, then this is true by induction,
and if they do not share it, then (An )xy is an entry of one of the two identity-matrix parts
of An , which implies that x and y are equal in all the other coordinates.
Thus, An is obtained from the adjacency matrix of Qn by attaching signs to the edges.
We can describe these signs inductively by saying that when xn = 0 we use the signs from

53
An−1 , when xn = 1, we multiply all those signs by -1, and for the edges that go across from
xn = 0 to xn = 1 we have 1s everywhere. For a non-inductive description (which we will
not need), if x and y agree except at the ith coordinate, then Axy is (−1)xi+1 +···+xn .
An immediate consequence of these observations is that A2n = nI. (The orthogonality
implies that An ATn = nI and the symmetry
√ implies
√ that An = ATn .) From this it follows
that every eigenvalue of An is either n or − n. Since tr An = 0, the two eigenvalues
each occur with multiplicity 2n−1 .
We are now ready to prove the theorem.
Proof of Theorem 9.1. Let V be a subset of {0, 1}n of size 2n−1 + 1. Let us define a linear
map α : RV → RV by taking a function f : V → R to the function g : V → R defined by
X
g(v) = (An )vw f (w).
w∈V

In other words, we multiply f (considered as a column vector) by the matrix An and


restrict it to V .
If ∆ is the maximum degree of the subgraph of Qn induced by V , then
XX X X
kαf k1 ≤ |(An )vw ||f (w)| = |f (w)| |(An )vw | ≤ ∆kf k1 ,
v∈V w∈V w∈V v∈V

since |(An )vw | = 1 if vw is an edge and 0 otherwise.


Now the set of all functions from {0, 1}n to R that are supported in V is a space of
dimension 2n−1 + 1, and therefore, since the n−1
√ two eigenspaces of An have dimension 2 ,
there √is an eigenvector f with eigenvalue √ n supported in V . For this
√ f we have that
αf = nf , and therefore that kαf k1 = nkf k1 . It follows that ∆ ≥ n.
How should one react to an argument like this, which at first sight appears to work
by magic? I recommend thinking about it until it starts to seem more natural and less
magical. Note, for instance, that it is another example of a very useful general technique,
which is to bound a cardinality in terms of something else. An earlier example has been
our use of entropy to provide bounds for cardinalities. Here one can imagine a thought
process that begins with the well known observation that the largest eigenvalue of the
adjacency matrix of a graph is at most the maximum degree of the graph, then spots that
the same is true for any matrix with entries that are bounded in magnitude by those of the
adjacency matrix. √ So it is sufficient to find a matrix with that property and an eigenvector
with eigenvalue n. But we don’t know all that much about the graph, since its vertex
set is an arbitrary subset of {0, 1}n of size 2√ n−1
+ 1. That does tell us that we’d be done if
we could find a matrix with an eigenvector n of multiplicity at least 2n−1 . And although
it’s not trivial that such a matrix exists, once one starts to think about the properties it
ought to have, one arrives reasonably quickly at the wish that its non-zero entries should
all be ±1 and that the rows ought to be orthogonal, and actually finding it becomes a not
too difficult exercise at that point.

54
Actually, I know for a fact that that is not quite Huang’s thought process, since he used
a result called Cauchy’s interlace theorem, though the basic ideas were the same. The even
simpler proof presented above was spotted by Shalev Ben-David and shared in a comment
on Scott Aaronson’s blog post.

10 Dimension arguments
The topic of this last section wraps up the course quite nicely, as it will require us to
touch on a number of earlier topics that we have covered, including averaging arguments,
set systems with restrictions on intersections, combinatorial geometry, the Borsuk-Ulam
theorem, and use of polynomials.
The basic idea behind a dimension argument is that one can sometimes obtain an
upper bound for the size of a combinatorial object by constructing an injection from that
object to a low-dimensional vector space such that the image is a linearly independent set.
Then the dimension of the vector space gives the desired lower bound. This technique is
particularly useful for extremal problems where the maximum is far from unique, since it
transforms those problems into other problems (of the form, “How large an independent set
can we find in this vector space?”) that have the same property. We shall use dimension
arguments to prove a number of results about set systems, and then we shall discuss a
famous solution by Jeff Kahn and Gil Kalai to a problem in combinatorial geometry.

10.1 Subsets of Rn that give rise to at most two distances


Let X be a subset of Rn and suppose that there is only one distinct distance between any
two distinct points. How large can X be?
This is easy to answer. The condition tells us that X is a regular simplex, and the largest
regular simplex that fits into Rn is the n-dimensional simplex, which has n + 1 vertices.
(To make this argument rigorous, let X = {x1 , . . . , xm }, let x = (x1 + · · · + xm )/m, and
let yi = (xi − x)/kxi − xk for every i. Then one can check that hyi , yj i = −1/(m − 1) for
every i 6= j, and then results about well-separated vectors from earlier in the course yield
that m ≤ n + 1.)
What if we allow at most two distances? An example that gives n2 points is the set


of vectors ei + ej , where e1 , . . . , en are the standard basis vectors of Rn . It is not known


whether this is best possible, but a comparable upper bound can be obtained with a very
nice linear algebra argument.
Theorem 10.1. Let a1 , . . . , am be points in Rn such that the number of distinct distances
d(ai , aj ) with i 6= j is at most 2. Then m ≤ (n + 1)(n + 4)/2.
Proof. Let the two distances be a and b, and define a polynomial in 2n variables by

f (x, y) = (d(x, y)2 − a2 )(d(x, y)2 − b2 ).


Pn
Note that d(x, y)2 = i=1 (xi − yi )2 , so this is a polynomial of degree 4.

55
For each i, let fi be the n-variable polynomial given by fi (x) = f (x, ai ). Then these
polynomials are linearly independent, since fi (ai ) = a2 b2 , and fi (aj ) = 0 for every j 6= i.
To squeeze as much information as we can out of this observation, we would now like
to prove that the polynomials fi all lie in a low-dimensional subspace of the space of n-
variable polynomials. Since they are all quartics, we can obtain an upper bound of order
n4 with ease, but with slightly more thought we can do substantially better. Note that
X X X
d(x, ai )2 = x2j − 2 xj aij + a2ij
j j j

, where
P we write aij for the jth coordinate
P 2 of ai . It follows that fi (x) is a linear combination
2 2
of ( j xj ) , the n polynomials xh j xj , the n(n + 1)/2 polynomials xh xj with h ≤ j, the
n polynomials xj , and 1. So the fi live in a space of dimension 1 + n + n(n + 1)/2 + n + 1 =
(n + 1)(n + 4)/2.
The above bound is not the best known: using a strengthening of the same method,
Blokhuis obtained an upper bound of (n + 1)(n + 2)/2. But it is still an open problem to
determine the exact bound.

10.2 Set systems with intersections of restricted parity


Let A be a family of subsets of [n] such that every set in A has even cardinality and any
two distinct sets in A have an intersection of even cardinality. How large can A be?
A simple example shows that A can be of size 2bn/2c : we just take all sets that are
unions of some of the sets {1, 2}, {3, 4}, . . . . It also seems to be hard to do better than
this.
How about if we ask for the intersections to have odd size? Now it seems to be much
more challenging to find a large example. If n is odd, the system {1, 2}, {1, 3}, . . . , {1, n},
{2, 3, . . . , n} has size n and satisfies the conditions, but it cannot be extended, since any
set of even size that contains 1 will intersect at least one {1, i} in a set of size 2, and any
set of even size that does not contain 1 intersects {2, 3, . . . , n} in a set of even size.
Rather surprisingly (if you haven’t seen it before), these radically different bounds are
both sharp, and one can prove it quite simply using linear algebra. Let us start with the
second.

Theorem 10.2. Let A be a family of subsets of [n] of even size such that any two distinct
sets in the family have an intersection of odd size. Then |A| ≤ n if n is odd, and |A| ≤ n−1
if n is even.

Proof. We begin by proving the same statement but with “even” and “odd” reversed: that
is, we assume that the sets have odd size and that their intersections have even size.
Associate with each set A ∈ A its characteristic vector χA and regard that vector as
n
an element
P of F2 . We can define an F2 -valued inner product in the obvious way, namely
hf, gi = x f (x)g(x), where the sum is in F2 . Note that with respect to this inner product,

56
we have that hχA , χA i = 1 and hχA , χB i = 0 for every A, B ∈ A. In other words, the
characteristic vectors form an “orthonormal basis” over F2 . P
The usual proof that orthogonal
P vectors are independent works here too: if A λA χA =
0, then for every B, 0 = h A λA χA , χB i = λB . Therefore, |A| ≤ n.
Note that the sets {1}, {2}, . . . , {n} show that this bound is best possible.
Now let us assume that the sets have even size and odd intersections.
If n is odd, we can replace each A by [n] \ A, which yields a family of sets of odd size
with even intersections. (This last assertion follows easily from the fact that if A and B are
sets with cardinalities of the same parity, then |A \ B| and |B \ A| have the same parity.)
Thus |A| ≤ n.
As noted above, the sets {1, 2}, . . . , {1, n}, {2, 3, . . . , n} show that this bound is best
possible.
If n is even, then replace each A ∈ A that contains n by [n] \ A and leave the remaining
sets as they are. If A and B are any two sets and |A| is even, then |A ∩ B| and |A ∩ B c |
have the same parity, and since n is even, B and B c have the same parity as well. Also,
A cannot contain a complementary pair since all intersections have odd size. Therefore,
we obtain a new set system with the same property and the same number of sets, but now
with all sets contained in [n − 1]. Since n − 1 is odd, we find that |A| ≤ n − 1.
The sets {1, 2}, . . . , {1, n} show that this bound is best possible.
Now let us consider what happens if the sets and their intersections have even sizes.
Theorem 10.3. Let A be a collection of subsets of [n] and suppose that |A| and |A ∩ B|
are even for every A, B ∈ A. Then |A| ≤ 2bn/2c .
Proof. Bounding A is equivalent to bounding the size of a subset F ⊂ Fn2 with the property
that hf, gi = 0 for every f, g ∈ F.
Let k be maximal such that F contains a linearly independent set of size k. Then F lies
in the orthogonal complement of that set, which has size 2n−k , so |F| ≤ 2n−k . But F also
lies in the linear span of that set, so |F| ≤ 2k . It follows that |F| ≤ maxk min{2k , 2n−k } =
2bn/2c .

10.3 More general restricted intersections


Earlier in the course, we looked at set systems where the sizes of intersections were required
to be large. Here we shall look at set systems where we ask for intersections not to be
of certain sizes, and we shall see that even ruling out one quite large size can force a set
system to have cardinality exponentially smaller than 2n . The next result is just a sample
of what can be done.
Theorem 10.4. Let A be a family of subsets of [n] such that the size of every A ∈ A is
a multiple of p, but the size of no intersection of two distinct sets in A is a multiple of p.
Then      
n n n
|A| ≤ + + ··· + .
0 1 p−1

57
Proof. Given a set A ∈ A, define a polynomial PA over Fp in n variables by
X
PA (x) = 1 − ( xi )p−1 .
i∈A
P
Note that PA (x) = 1 if i∈A xi = 0 and PA (x) = 0 otherwise (by Fermat’s little theorem).
Note also that PA has degree p − 1.
If x is the characteristic function of a set B ∈ A, then PA (x) = 1 if A = B and 0
otherwise, since in the second case our hypothesis implies that PA (x) is not zero mod p.
This proves that the polynomials PA are linearly independent.
However, this is not enough to give us the bound stated. For that, we need a further
trick, which is to observe that if P is any polynomial in n variables and we replace each
occurrence of xri by xi (so for example the monomial x31 x64 x11
5 would become the monomial
x1 x4 x5 ), then we obtain a new polynomial that takes the same value as P on any x that
has all its coordinates equal to 0 or 1, since for such an x we have that xri = xi for every
r ≥ 1.
So now let us replace each PA by a polynomial QA that agrees with PA on all charac-
teristic functions of sets and has degree at most 1 in each variable xi . Then QA (χB ) =
PA (χB ) = δAB for every A, B ∈ A, so the QA are also independent.
The number of monomials of degree at most p − 1 and of degree at most 1 in each
variable xi is      
n n n
+ + ··· + ,
0 1 p−1
so the QA live in a space of this dimension, which proves the result.
Note that in the case p = 2 this gives a bound of n + 1, which is pretty close to the
exact bound we obtained in the previous subsection.
An interesting special case of the above result is where n = 4p and the sets all have
size 2p. Then the result tells us that we have a collection A of subsets of [4p] of size 2p, no
two of which are disjoint and no two ofPwhich have an intersection of size exactly p. With
this constraint, we deduce that |A| ≤ m<p 4p m
, which is exponentially smaller than 24p .
From this, we can obtain the following geometrical consequence.

Corollary 10.5. Let p be an odd prime and let n = 4p. Then the largest measure of a set
X of unit vectors in R4p that contains no pair of orthogonal vectors is exponentially small.

Proof. Let Y be the collection of unit vectors all of whose coordinates are either 1/2 p or

−1/2 p and that sum to zero. If A ⊂ [4p] is a set of size 2p, let vA be the vector in Y
√ √
that takes the value 1/2 p on A and −1/2 p on [4p] \ A. Observe that hvA , vB i = 0 if
and only if |A ∩ B| = p. It therefore follows from Theorem 10.4 that |X ∩ Y | ≤ p−1 n
P 
m=0 m ,
which is exponentially smaller than 2n since p = n/4.
Since the property of not containing a pair of orthogonal vectors is invariant under
n
rotation, it also
Pp−1follows
n
 that if ρ is any rotation of R , the intersection ρX ∩ Y also has
size at most m=0 m .

58
Now we do an averaging argument very similar to the averaging step in Katona’s proof
of the Erdős-Ko-Rado theorem. Let ρ be a random rotation of Rn . Then for each y ∈ Y ,
the probability that y ∈ ρX is µX (where µ is the rotation-invariant probability measure
Pp−1 n

on the sphere), so the expected size of ρX ∩ Y is µ|Y |. But it is also at most m=0 m ,
n −1 Pp−1 n
 
from which we deduce that µ ≤ 2p m=0 m . The estimates in Section 2 imply easily
that this is exponentially small as a function of n.
n n
 
To justify the last sentence of the above proof slightly more, 2p = n/2 , which is
−1/2 n n
P 
n 2 up to a constant factor, and we also know that if α < 1/2, then the sum m≤αn m
is exponentially smaller than 2n , by Lemma 2.6.

10.4 Kahn and Kalai’s disproof of Borsuk’s conjecture


The following conjecture was made by Borsuk in 1933.
n
Conjecture 10.6. Let X be a bounded subsetS of R . Then there exist sets Y1 , . . . , Yn+1 of
diameter less than that of X such that X ⊂ i Yi .
To see why this is a reasonable conjecture, consider the case where X is the unit ball
of `n2 – that is, the set of all vectors of norm at most 1. As we saw in the section on using
the Borsuk-Ulam theorem, if we cover the boundary of X with n closed sets, then one of
those sets will contain an antipodal pair, and therefore will have diameter at least that of
X. Therefore, the bound n + 1, if correct, is best possible. (On the third examples sheet I
ask you to show that Borsuk’s conjecture is true for this X.)
It is not hard to see that we may assume that X and Y1 , . . . , Yn are closed and convex,
since taking the closure of the convex hull of a set does not change its diameter. The
conjecture came to seem even more reasonable when it was proved to be true when X is
smooth. (I won’t say here precisely what this means, but roughly speaking it means that
the boundary of X has no sharp angles.)
It was therefore a huge surprise (yet another) when √ Jeff Kahn and Gil Kalai not only
disproved it, but obtained an exponentially large (in n) lower bound for the number of
sets of smaller diameter needed to cover X, and even more of a surprise (yet another) that
the proof was very short and easy to understand. I was a graduate student at the time,
and was lucky enough to see Jeff Kahn presenting the proof in Cambridge just after it had
been announced in 1993.
I’ll give an argument that’s not quite identical to theirs, but it’s very similar. (It will
give a very slightly worse bound, but avoid a tiny technical improvement to the argument
in the previous section.) The next result is another corollary to Theorem 10.4.
2 n

Corollary 10.7. Let n = 4p for a prime p. Then Rn contains a set X of size 2p such
Pp−1 n 
that every subset of X of smaller diameter has size at most m=0 m .
Proof. Let Y be the set of unit vectors defined in the proof of Corollary 10.5 and, again as
in that proof, for each A ∈ [4p](2p) let vA ∈ Y be the vector with all its positive coordinates
in A. As noted in that proof, vA and vB are orthogonal if and only if |A ∩ B| = p.

59
2
Now let X ⊂ Rn be the set of all points v ⊗ v such that v ∈ Y . (We saw this notation
in the section on the cap-set problem: v ⊗ v denotes the matrix with ijth entry vi vj , which
2
we can think of as a point in Rn .) For any v, w ∈ Rn ,
X
hv ⊗ v, w ⊗ w = vi vj wi wj = hv, wi2 ≥ 0,
ij

with equality if and only if v and w are orthogonal. This calculation also shows that if v
and w are unit vectors, then so are v ⊗ v and w ⊗ w, which implies that

kv ⊗ v − w ⊗ wk2 = kv ⊗ vk2 + kw ⊗ wk2 − 2hv ⊗ v, w ⊗ wi


= 2 − 2hv, wi2
≤ 2,


with equality if and only if hv, wi = 0. It follows that X has diameter 2, and that no
subset of X of smaller diameter contains a pair of vectors vA ⊗ vA and vB ⊗ vB such that
vA and vB are orthogonal, or equivalently such that |A ∩ B| = p. The result now follows
from Theorem 10.4.
n
 Pp−1 n 
It follows immediately that X cannot be covered by fewer than 2p m=0 m subsets
of smaller diameter. This is exponentially large in n, and so exponentially large in the
square root of the dimension of the space that contains X.

60

You might also like