Functional Equations: Tom Leinster Spring 2017
Functional Equations: Tom Leinster Spring 2017
Functional Equations: Tom Leinster Spring 2017
Tom Leinster∗
Spring 2017
Contents
1 Warm-up 2
2 Shannon entropy 6
3 Deformed entropies 19
4 Probabilistic methods 22
Preamble
Hello.
Admin: email addresses; sections in outline 6= lectures; pace.
Overall plan: interested in unique characterizations of . . .
1
• elementary real analysis
• (new!) some probabilistic methods.
One ref: Aczél and Daróczy, On Measures of Information and Their Char-
acterizations. (Comments.) Other refs: will give as we go along.
1 Warm-up
Week I (7 Feb)
Which functions f satisfy f (x + y) = f (x) + f (y)? Which functions of two
variables can be separated as a product of functions of one variable?
This section is an intro to basic techniques. We may or may not need the
actual results we prove.
There are some obvious solutions. Are they the only ones? Weak result first,
to illustrate technique.
Proposition 1.1 Let f : R → R be a differentiable function. TFAE (the fol-
lowing are equivalent):
i. f satisfies (1)
ii. there exists c ∈ R such that
∀x ∈ R, f (x) = cx.
∀x, y ∈ R, f 0 (x + y) = f 0 (x).
2
• f (x) + f (−x) = f (x + (−x)) = f (0) = 0, so f (−x) = −f (x). Cf. group
homomorphisms.
• Next, f (nx) = nf (x) for all x ∈ R and n ∈ Z. For n > 0, true by
induction. For n = 0, says f (0) = 0. For n < 0, have −n > 0 and so
f (nx) = −f (−nx) = −(−nf (x)) = nf (x).
• In particular, f (n) = nf (1) for all n ∈ Z.
• For m, n ∈ Z with n 6= 0, we have
but also
f (n · m/n) = nf (m/n),
so f (m/n) = (m/n)f (1). Hence f (x) = f (1)x for all x ∈ Q.
• Now f and x 7→ f (1)x are continuous functions on R agreeing on Q, hence
are equal.
3
Proof For (i), define g : R → R by g(x) = log f (x). Then g is continuous and
satisfies Cauchy FE, so g(x) = cx for some constant c, and then f (x) = ecx .
(ii) and (iii): similarly, putting g(x) = f (ex ) and g(x) = log f (ex ).
Related:
Theorem 1.5 (Erdős?) Let f : Z+ → (0, ∞) be a function satisfying f (mn) =
f (m)f (n) for all m, n ∈ Z+ . (There are loads of solutions: can freely choose
f (p) for every prime p. But . . . ) Suppose that either f (1) ≤ f (2) ≤ · · · or
f (n + 1)
lim = 1.
n→∞ f (n)
Proof Omitted.
Separation of variables
Note added later: we don’t actually use this part anywhere later in the course.
When can a function of two variables be written as a product/sum of two
functions of one variable? We’ll do sums, but can convert to products as in
Corollary 1.4.
Let X and Y be sets and
f: X ×Y →R
g : X → R, h: Y → R
such that
∀x ∈ X, y ∈ Y, f (x, y) = g(x) + h(y). (2)
Basic questions:
A Are there any pairs of functions (g, h) satisfying (2)?
B How can we construct all such pairs?
C How many such pairs are there? Clear that if there are any, there are many,
by adding/subtracting constants.
I got up to here in the first class, and was going to lecture the rest of this
section in the second class, but in the end decided not to. What I actually
lectured resumes at the start of Section 2. But for completeness, here’s the rest
of this section.
(x, x0 ∈ X, y ∈ Y ). No hs involved!
First lemma: g and h are determined by f , up to additive constant.
4
Lemma 1.6 Let g : X → R and h : Y → R be functions. Define f : X × Y → R
by (2). Let x0 ∈ X and y0 ∈ Y .
Then there exist c, d ∈ R such that c + d = f (x0 , y0 ) and
g(x) = f (x, y0 ) − c ∀x ∈ X
h(y) = f (x0 , y) − d ∀y ∈ Y
by (2).
But given f (and x0 and y0 ), is every pair (g, h) of this form a solution
of (2)? Not necessarily (but it’s easy to say when). . .
Lemma 1.7 Let f : X × Y → R be a function. Let x0 ∈ X, y0 ∈ Y , and
c, d ∈ R with c + d = f (x0 , y0 ). Define g : X → R by (3) and h : Y → R by (4).
If
f (x, y0 ) + f (x0 , y) = f (x, y) + f (x0 , y0 ) ∀x ∈ X, y ∈ Y
then
f (x, y) = g(x) + h(y) ∀x ∈ X, y ∈ Y.
etc.
5
Proposition 1.9 Let f : X × Y → R be a function satisfying the equivalent
conditions of Proposition 1.8, and let x0 ∈ X and y0 ∈ Y . Then a pair of
functions (g : X → R, h : Y → R) satisfies (2) if and only if there exist c, d ∈ R
satisfying c + d = f (x0 , y0 ), (3) and (4).
{(g + a, h − a) : a ∈ R}.
2 Shannon entropy
Week II (14 Feb)
Recap, including Erdős theorem. No separation of variables!
The many meanings of the word entropy. Ordinary entropy, relative entropy,
conditional entropy, joint entropy, cross entropy; entropy on finite and infinite
spaces; quantum versions; entropy in topological dynamics; . . . . Today we stick
to the very simplest kind: Shannon entropy of a probability distribution on a
finite set.
P Let p = (p1 , . . . , pn ) be a probability distribution on {1, . . . , n} (i.e. pi ≥ 0,
pi = 1). The (Shannon) entropy of p is
X X 1
H(p) = − pi log pi = pi log .
i : pi >0 i : pi >0
pi
The sum is over all i ∈ {1, . . . , n} such that pi 6= 0; equivalently, can sum over
all i ∈ {1, . . . , n} but with the convention that 0 log 0 = 0.
Ways of thinking about entropy:
• Disorder.
• Uniformity. Will see that uniform distribution has greatest entropy among
all distributions on {1, . . . , n}.
• Expected surprise. Think of log(1/pi ) as your surprise at learning that an
event of probability pi has occurred. The smaller pi is, the more surprised
you are. Then H(p) is the expected value of the surprise: how surprised
you expect to be!
• Information. Similar to expected surprise. Think of log(1/pi ) as the infor-
mation that you gain by observing an event of probability pi . The smaller
pi is, the rarer the event is, so the more remarkable it is. Then H(p) is
the average amount of information per event.
6
• Genericity. In context of thermodynamics, entropy measures how generic
a state a system is in. Closely related to ‘lack of information’.
First properties:
• H(p) ≥ 0 for all p, with equality iff p = (0, . . . , 0, 1, 0, . . . , 0). Least
uniform distribution.
• H(p) ≤ log n for all p, with equality iff p = (1/n, . . . , 1/n). Most uniform
distribution. Proof that H(p) ≤ log n uses concavity of log:
X X
1 1
H(p) = pi log pi ≤ log pi pi ≤ log n.
i : pi >0 i : pi >0
H( 21 , 14 , 18 , 18 ) = 1
2 log2 2 + 1
4 log2 4 + 1
8 log2 8+ 1
8 log2 8
1 1 1
= 2 ·1+ 4 ·2+ 8 ·3 + 18 · 3
= 1 43 .
iii. That example was special in that all the probabilities were integer powers
of 2. But. . . Can still make sense of this when probabilities aren’t
powers of 2 (Shannon’s first theorem). E.g. frequency distribution p =
(p1 , . . . , p26 ) of letters in English has H(p) ≈ 4, so can encode English in
about 4 bits/letter. So, it’s as if English had only 16 letters, used equally
often.
7
Will now explain a more subtle property of entropy. Begin with example.
Example 2.3 Flip a coin. If it’s heads, roll a die. If it’s tails, draw from a pack
of cards. So final outcome is either a number between 1 and 6 or a card. There
are 6 + 52 = 58 possible final outcomes, with probabilities as shown (assuming
everything unbiased):
1/2 1/2
COIN
DIE CARDS
How much information do you expect to get from observing the outcome?
• You know result of coin flip, giving H(1/2, 1/2) = 1 bit of info.
• With probability 1/2, you know result of die roll: H(1/6, . . . , 1/6) = log2 6
bits of info.
• With probability 1/2, you know result of card draw: H(1/52, . . . , 1/52) =
log2 52 bits.
In total:
1 1
1+ 2 log2 6 + 2 log2 52
bits of info. This suggests
1 1 1 1 1 1
H 12 , . . . , 12 , 104 , . . . , 104 =1+ 2 log2 6 + 2 log2 52.
| {z } | {z }
6 52
w ∈ ∆n , p1 ∈ ∆k1 , . . . , pn ∈ ∆kn ,
8
get composite distribution
w ◦ (p1 , . . . , pn ) = (w1 p11 , . . . , w1 p1k1 , . . . , wn pn1 , . . . , wn pnkn ) ∈ ∆k1 +···+kn
w
......
p1 ...... pn
...... ......
This is joint probability distribution if the two things are independent. Then
chain rule implies multiplicativity:
H(w ⊗ p) = H(w) + H(p).
Interpretation: information from two independent observations is sum of infor-
mation from each.
Where are the functional equations?
For each n ≥ 1, have function H : ∆n → R+ = [0, ∞). Faddeev2 showed:
Theorem 2.4 (Faddeev, 1956) Take functions I : ∆n → R+ n≥1 . TFAE:
i. the functions I are continuous and satisfy the chain rule;
ii. I = cH for some c ∈ R+ .
That is: up to a constant factor, Shannon entropy is uniquely characterized
by continuity and chain rule.
Should we be disappointed to get scalar multiples of H, not H itself? No:
recall that different scalar multiples correspond to different choices of the base
for log.
Rest of this section: proof of Faddeev’s theorem.
Certainly (ii)⇒(i). Now take I satisfying (i).
un = (1/n, . . . , 1/n) ∈ ∆n . Strategy: think about the sequence
Write
I(un ) n≥1 . It should be (c log n)n≥1 for some constant c.
1 This is completely straightforward, but can be made even more transparent by first observ-
ing that the function f (x) = −x log x is a ‘nonlinear derivation’, i.e. f (xy) = xf (y)+f (x)y. In
fact, −x log x is the only measurable function F with this property (up to a constant factor),
since if we put g(x) = F (x)/x then g(xy) = g(y) + g(x) and so g(x) ∝ log x.
2 Dmitry Faddeev, father of the physicist Ludvig Faddeev.
9
Lemma 2.5 i. I(umn ) = I(um ) + I(un ) for all m, n ≥ 1.
ii. I(u1 ) = 0.
Theorem 1.5 (Erdős) would now tell us that I(un ) = c log n for some constant
c (putting f (n) = exp(I(un ))). But to conclude that, we need one of the two
alternative hypotheses of Theorem 1.5 to be satisfied. We prove the second one,
on limits. This takes some effort.
Lemma 2.6 I(1, 0) = 0.
Second,
Proof We have
n 1
un+1 = n+1 , n ◦ (un , u1 ),
so by the chain rule and I(u1 ) = 0,
n 1 n
I(un+1 ) = I n+1 , n + n+1 I(un ).
So
n n 1
I(un+1 ) − n+1 I(un ) =I n+1 , n+1 → I(1, 0) = 0
as n → ∞, by continuity and Lemma 2.6.
To improve this to I(un+1 )−I(un ) → 0, use a general result that has nothing
to do with entropy:
n
Lemma 2.8 Let (an )n≥1 be a sequence in R such that an+1 − n+1 an → 0 as
n → ∞. Then an+1 − an → 0 as n → ∞.
10
It is enough to prove that an /(n + 1) → 0 as n → ∞. Write b1 = a1 and
bn = an − n−1
n an−1 for n ≥ 2. Then nan = nbn + (n − 1)an−1 for all n ≥ 2, so
11
Theorem 2.4 follows by continuity.
......
a b c z
1 1 1 1 1
2 4 4 1 2 2 ...... 1
Entropy 1.5 Entropy 0
a à â b c ç z
Ent 0 Ent 1
We need
3.8
|{z} + 0.05 × 1.5 + 0.02 × 0 + 0.03 × 1 + · · · + 0.004 × 0
| {z }
bits for actual letters
bits for accents
12
Relative entropy
Let p, r ∈ ∆n . The entropy of p relative to r is
p
i
X
H(p k r) = pi log .
i : p >0
ri
i
So relative entropy is the number of extra bits needed if you use the wrong
machine. Or: penalty you pay for using the wrong machine. Explains why
H(p k r) ≥ 0 with equality if p = r.
13
If ri = 0 then in machine r, the ith symbol has an infinitely long code word.
Or if you like: if ri = 2−1000 then its code word has length 1000. So if also
pi > 0 then for language p encoded using machine r, average bits/symbol = ∞.
This explains why H(p k r) = ∞.
Taking r = un ,
where dµdν is Radon–Nikodym derivative. This makes sense and is the right
definition.
People do talk about the entropy of probability distributions on Rn . For
instance, the entropy
R of a probability density function f on R is usually defined
as H(f ) = − R f (x) log f (x) dx, and it’s an important result that among all
density functions on R with a given mean and variance, the one with the max-
imal entropy is the normal distribution. (This is related to the central limit
theorem.) But here we’re implicitly using Lebesgue measure λ on R; so there
are two measures in play, λ and f λ, and H(f ) = H(f λ k λ).
14
Locally, H(− k −) is like a squared distance.
In particular, locally (to second order) it’s symmetric.
The square root of relative entropy is not a metric on ∆n : not symmetric and
fails triangle
p inequality.
p (E.g. putpp = (0.9, 0.1), q = (0.2, 0.8), r = (0.1, 0.9).
Then H(p k q) + H(q k r) < H(p k r).) But using it as a ‘local distance’
leads to important things, e.g. Fisher information (statistics), the Jeffreys prior
(Bayesian statistics), and the whole subject of information geometry.
H(− k −) : An → R+ .
Properties:
e ∈ An , (p1 , p
(w, w) e 1 ) ∈ Ak1 , . . . , (pn , p
e n ) ∈ Ak n ,
we have
n
X
H(w ◦ (p1 , . . . , pn ) k w p1 , . . . , p
e ◦ (e e n )) = H(w k w)
e + wi H(pi k p
e i ).
i=1
15
Consider Swiss French and Canadian French:
w ∈ ∆26 : frequencies of letters in Swiss
e ∈ ∆26 : frequencies of letters in Canadian
w
and then
p1 ∈ ∆3 : frequencies of accents on ‘a’ in Swiss
e 1 ∈ ∆3 : frequencies of accents on ‘a’ in Canadian
p
..
.
p26 ∈ ∆1 : frequencies of accents on ‘z’ in Swiss
e 26 ∈ ∆1 : frequencies of accents on ‘z’ in Canadian.
p
So
w ◦ (p1 , . . . , p26 ) = frequency distribution of all symbols in Swiss
p1 , . . . , p
e ◦ (e
w e 26 ) = frequency distribution of all symbols in Canadian
where ‘symbol’ means a letter with (or without) an accent.
Now encode Swiss using Canadian machine. How much extra does it cost
(in mean bits/symbol) compared to encoding Swiss using Swiss machine?
16
Proof Case k = n trivial; suppose k < n.
Since p is a probability distribution with pi = 0 for all i > k, there is some
i ≤ k such that pi > 0, and then ri > 0 since (p, r) ∈ An . Hence r1 +· · ·+rk > 0.
By definition of operadic composition,
I(p k r) = I (1, 0) ◦ (p0 , r00 ) (r1 + · · · + rk , rk+1 + · · · + rn ) ◦ (r0 , r00 )
and result follows. In order to use the chain rule, we needed to know that various
pairs were in A2 , Ak , etc.; that’s easily checked.
Proof Consider
Result follows.
Lemma 2.15 There is a unique constant c ∈ R+ such that L(α) = −c log α for
all α ∈ (0, 1].
Proof Follows from Lemma 2.14 and measurability, as in Cor 1.4. That corol-
lary was stated under the hypothesis of continuity, but measurability would have
been enough.
Proof We have (p, r) ∈ An , so ri > 0 for all i. So can choose α ∈ (0, 1] such
that ri − αpi ≥ 0 for all i.
We will compute the (well-defined) number
x := I (p1 , . . . , pn , 0, . . . , 0) (αp1 , . . . , αpn , r1 − αp1 , . . . , rn − αpn )
| {z }
n
17
Second, by symmetry and then the chain rule,
x = I((p1 , 0, . . . , pn , 0) k (αp1 , r1 − αp1 , . . . , pn , rn − αpn ))
= I p ◦ (1, 0), . . . , (1, 0) r ◦ (α pr11 , 1 − α pr11 ), . . . , (α prnn , 1 − α prnn )
n
X
pi L α prii
= I(p k r) +
i=1
= I(p k r) − c log α − cH(p k r).
Comparing the two expressions for x gives the result.
Proof of Theorem 2.12 Let (p, r) ∈ An . By symmetry, can assume
p1 > 0, . . . , pk > 0, pk+1 = 0, . . . , pn = 0
where 1 ≤ k ≤ n. Writing R = r1 + · · · + rk ,
1
I(p k r) = L(R) + I((p1 , . . . , pk ) k R (r1 , . . . , rk )) by Lemma 2.13
= −c log R + cH((p1 , . . . , pk ) k R1 (r1 , . . . , rk )) by Lemmas 2.15 and 2.16.
This holds for all I(− k −) satisfying the four conditions; in particular, it holds
for cH(− k −). Hence I(p k r) = cH(p k r).
Remark 2.17 Assuming permutation-invariance, the chain rule is equivalent
to a very special case:
Week V (7 Mar)
Recap: formulas for entropy and relative entropy (in both log(1/something)
and − log(something) forms); chain rules for both.
Remark 2.18 I need to make explicit something we’ve been using implicitly
for a while.
Let p ∈ ∆n and r ∈ ∆m . Then we get a new distribution
p ⊗ r = (p1 r1 , . . . , p1 rm ,
..
.
pn r1 , . . . , pn rm )
= p ◦ (r, . . . , r)
| {z }
n
∈ ∆nm .
18
Probabilistically, this is the joint distribution of independent random variables
distributed according to p and r.
A special case of the chain rule for entropy:
H(p ⊗ r k p
e ⊗e
r) = H(p k p
e ) + H(r k e
r).
(So H and H(− k −) are log-like; like a higher version of Cauchy’s functional
equation.)
3 Deformed entropies
Shannon entropy is just a single member of a one-parameter family of entropies.
In fact, there are two different one-parameter families of entropies, both con-
taining Shannon entropy as a member. In some sense, these two families of
entropies are equivalent, but they have different flavours.
I’ll talk about both families: surprise entropies in this section, then Rényi
entropies when we come to measures of diversity later on.
Definition 3.1 Let q ∈ [0, ∞). The q-logarithm lnq : (0, ∞) → R is defined
by ( 1−q
x −1
1−q if q 6= 1,
lnq (x) =
ln(x) if q = 1.
lnq (xy) 6= lnq (x) + lnq (y), lnq (1/x) 6= − lnq (x).
First one inevitable: we already showed that scalar multiples of actual logarithm
are the only continuous functions that convert multiplication into addition. In
fact, there’s quite a neat formula for lnq (xy) in terms of lnq (x), lnq (y) and q:
exercise!
For a probability p ∈ [0, 1], can view
lnq (1/p)
19
as one’s ‘surprise’ at witnessing an event with probability p. (Decreasing in p;
takes value 0 at p = 1.)
q=0
q=1
lnq (1/p)
q=2
q=3
Theorem 3.3 Let q ∈ [0, ∞)\{1}. Let I : ∆n → R+ n≥1
be functions. TFAE:
20
i. I is permutation-invariant and q-multiplicative in sense above;
ii. I = cSq for some c ∈ R+ .
No regularity condition needed! And don’t need full chain rule—just a spe-
cial case, multiplicativity.
Following proof is extracted from Aczél and Daróczy’s book (Theorem 6.3.9),
but they make it look way more complicated. The key point is that the multi-
plicativity property is not symmetric.
Proof (ii)⇒(i) easy. Assume (i). By permutation-invariance,
I(p ⊗ r) = I(r ⊗ p)
hence X X q
riq I(p) = 1 −
1− pi I(r).
Now want to get the ps onPone side and the rs on the other, to deduce that
I(p) is proportional to 1 − pqi . But need to be careful about division by zero.
Take r = u2 = (1/2, 1/2): then
X q
(1 − 21−q )I(p) = 1 − pi I(u2 )
1−q
for all p. But q 6= 1, so 1 − 21−q 6= 0. So I = cSq where c = I(u2 ) 21−q −1 .
Relative entropy generalizes easily too, i.e. extends all along the family from
the point q = 1. Only ticklish point is lnq (1/x) vs. − lnq (x).
Definition 3.4 Let q ∈ [0, ∞). For p, r ∈ ∆n , the relative surprise entropy
of order q is
( P q 1−q
1
X r
i 1−q 1 − pi ri if q 6= 1,
Sq (p k r) = − pi lnq = P
pi p
i : p >0
i
pi log i
ri if q = 1.
Properties:
• Permutation-invariance (as in case q = 1).
• q-chain rule:
X
Sq (w◦(p1 , . . . , pn )kw◦(e
e p1 , . . . , p
e n )) = Sq (wkw)+
e wiq w
ei1−q Sq (pi ke
pi ).
i : wi >0
21
Again, there’s a ludicrously simple characterization theorem that needs no
regularity condition. Nor does it need the vanishing condition of Theorem 2.12.
Recall notation:
An = {(p, r) ∈ ∆n × ∆n : ri = 0 =⇒ pi = 0}.
Then Sq (p k r) < ∞ ⇐⇒ (p, r) ∈ An .
Theorem 3.5 Let q ∈ [0, ∞) \ {1}. Let I(− k −) : An → R+ n≥1
be functions.
TFAE:
i. I(− k −) is permutation-invariant and q-multiplicative in sense above;
ii. I(− k −) = cSq (− k −) for some c ∈ R+ .
Proof (ii)⇒(i) easy. Assume (i). By permutation-invariance,
I(p ⊗ r k p
e ⊗e
r) = I(r ⊗ p k e
r⊗p
e)
e ) ∈ An and (r, e
for all (p, p r) ∈ Am . So by q-multiplicativity,
X X
I(p k pe) + pqi pe1−q
i I(r k e
r) = I(r k e
r) + riq rei1−q I(p k p
e ),
hence X X q 1−q
1− riq rei1−q I(p k p
e) = 1 − pi pei I(r k e
r).
Take r = (1, 0) and e
r = u2 : then
X
(1 − 2q−1 )I(p k p pqi pe1−q
e ) = I((1, 0) k u2 ) 1 − i
4 Probabilistic methods
Week VI (14 Mar)
How to use probability theory to solve functional equations.
References: Aubrun and Nechita, arXiv:1102.2618, S.R.S. Varadhan, Large
deviations.
Preview I want to make this broadly accessible, so I need to spend some time
explaining the background before I can show how to actually solve functional
equations using probability theory. We won’t reach the punchline until next
time. But to give the rough idea. . .
Functional equations have no stochastic element to them. So how could
probability theory possibly help to solve them?
Basic idea: use probability theory to replace complicated, exact formulas by
simple, approximate formulas.
Sometimes, an approximation is all you need.
E.g. (very simple example) multiply out the expression
(x + y)1000 = (x + y)(x + y) · · · (x + y).
What terms do we obtain?
22
• Exact but complicated answer: every term is of the form xi y j with 0 ≤
i ≤ 1000 and i + j = 1000, and this term occurs 1000!/i!j! times.
• Simple but approximate answer: most terms are of the form xi y j where i
and j are about 500. (Flip a fair coin 1000 times, and you’ll usually get
about 500 heads.)
Use probability theory to get descriptions like second. (Different tools of prob-
ability theory allow us to be give more or less specific meanings to ‘mostly’ and
‘about’, depending on how good an approximation we need.)
Aubrun and Nechita used this method to characterize the p-norms. Can also
use their method to characterize means, diversity measures, etc.
We’ll need two pieces of background: basic result on large deviations, basic
result on convex duality, then how they come together.
P(X r ≥ x) ≈ k(x)r .
If x < E(X) then k(x) = 1. Focus on x > E(X); then k(x) < 1 and P(X r ≥ x)
decays exponentially with r.
Theorem 4.1 (Cramér) The limit
E(eλX )
inf .
λ≥0 eλx
23
This is a standard result. Nice short proof: Cerf and Petit, arXiv:1002.3496.
C&P state it without any kind of hypothesis on finiteness of moments; is that
correct?
• For x ∈ R,
1
P(X r ≥ x) = nr {(i1 , . . . , ir ) : ci1 + · · · + cir ≥ rx} .
• For λ ∈ R,
E(eλx ) = 1
ec1 λ + · · · + ecn λ
n
• So by Cramér, for x ∈ R,
Convex duality
Let f : R → [−∞, ∞] be a function. Its convex conjugate or Legendre-
Fenchel transform is the function f ∗ : R → [−∞, ∞] defined by
24
If f is differentiable, then f ∗ describes y-intercept of tangent line as a func-
tion of its gradient.
f (x)
gradient λ
x
(0, −f ∗ (λ))
Now some hand-waving. Ignoring fact that this only holds for x ≥ E(X), take
conjugate of each side. Then for all λ ≥ 0 (a restriction we need to make the
hand-waving work),
(log mX )∗∗ (λ) = sup λx + lim 1r log P(X r ≥ x) .
x∈R r→∞
25
It’s a general fact that log mX (called the cumulant generating function) is
convex. Hence (log mX )∗∗ = log mX by Fenchel–Moreau. So (taking exponen-
tials)
mX (λ) = sup lim eλx P(X r ≥ x)1/r .
x∈R r→∞
This really is true. It’s a general formula for the moment generating function.
To make the hand-waving respectable: the full formula is
(
∗ limr→∞ − 1r log P(X r ≥ x) if x ≥ E(X),
(log mX ) (x) =
limr→∞ − 1r log P(X r ≤ x) if x ≤ E(X).
Can prove this by applying Cramér to −X and −x. Then can use it to get the
formula above for (log mX )∗∗ (λ) when λ ≥ 0.
We’ll use a variant:
Theorem 4.6 (Dual Cramér) For any IID X, X1 , X2 , . . ., for all λ ≥ 0,
mX (λ) = sup sup eλx P(X r ≥ x)1/r .
x∈R r≥1
This is the result we’ll use (not Cramér’s theorem itself). Now let’s apply
it.
Example 4.7 As before, let c1 , . . . , cn ∈ R and consider uniform distribution
on c1 , . . . , cn . Dual Cramér gives
1/r
ec1 λ + · · · + ecn λ = sup eλx (i1 , . . . , ir ) : ci1 + · · · + cir ≥ rx
x∈R,r≥1
for all λ ≥ 0.
Remember the background to all this: Aubrun and Nechita used large de-
viation theory to prove a unique characterization of the p-norms. Since the
left-hand side here is a sum of powers (think λ = p), we can now begin to see
the connection.
Next time: we’ll exploit this expression for sums of powers. We’ll use it
to prove a theorem that pins down what’s so special about the p-norms, and
another theorem saying what’s so special about power means.
26
Then you can ask: why do they attach such importance to that object, not
something slightly different? Is it the only object that enjoys the properties it
enjoys? If not, why do they use the object they use and not some other object
enjoying those properties? (Are they missing a trick?) And if it is the only
object with those properties, we should be able to prove it!
We’ll do this now for the p-norms, which are very standard in analysis.
Following theorem and proof are from Aubrun and Nechita, arXiv:1102.2618.
I believe they were the pioneers of this probabilistic method for solving func-
tional equations. See also Tricki, ‘The tensor power trick’.
Notation: for a set I (which will always be finite for us), write
E.g. R{1,...,n} = Rn . In some sense we might as well only use Rn , because every
RI is isomorphic to Rn . But the notation will be smoother if we allow arbitrary
finite sets.
Definition 4.8 Let I be a set. A norm on RI is a function RI → R+ , written
x 7→ kxk, satisfying:
i. kxk = 0 ⇐⇒ x = 0;
ii. kcxk = |c| kxk for all c ∈ R, x ∈ RI ;
iii. kx + yk ≤ kxk + kyk.
There are many norms, but the following family comes up more often than
any other.
Example 4.9 Let I be any finite set and p ∈ [1, ∞]. The p-norm k · kp on RI
is defined for x = (xi )i∈I ∈ RI by
( P 1/p
i∈I |xi |p if p < ∞,
kxkp =
maxi∈I |xi | if p = ∞.
Aubrun and Nechita prove and use following result, which they don’t quite
make explicit:
Proposition 4.10 For p ∈ [1, ∞) and x ∈ (0, ∞)n ,
1/rp
kxkp = sup u · {(i1 , . . . , ir ) : xi1 · · · xir ≥ ur } .
u>0,r≥1
Fix p ∈ [1, ∞]. For each finite set I, have the p-norm on RI . These enjoy
some special properties. Special properties of the p-norms:
27
• We have
k(x3 , x2 , x1 )kp = k(x1 , x2 , x3 )kp (5)
(etc.) and
k(x1 , x2 , x3 , 0)kp = k(x1 , x2 , x3 )kp (6)
(etc). Generally, any injection f : I → J induces a map f∗ : R → RJ , I
(x ∈ RI , j ∈ J). Then
kf∗ xkp = kxkp (7)
I
for all injections f : I → J and x ∈ R . For permutations f of {1, . . . , n},
(7) gives equations such as (5); for inclusion {1, . . . , n} ,→ {1, . . . , n, n+1},
(7) gives equations such as (6).
• For A, B, x, y, z ∈ R, we have
k(Ax, Ay, Az, Bx, By, Bz)kp = k(A, B)kp k(x, y, z)kp
(If you identify RI ⊗ RJ with RI×J , as you can for finite sets, then x ⊗ y
means what you think.) Then
Example 4.12 For each p ∈ [1, ∞], the p-norm k · kp is a multiplicative system
of norms.
28
Step 1: elementary results
Lemma 4.14 Let I be a finite set and x, y ∈ RI .
i. If yi = ±xi for each i ∈ I then kxk = kyk.
ii. If yi = |xi | for each i ∈ I then kxk = kyk.
iii. If 0 ≤ xi ≤ yi for each i ∈ I then kxk ≤ kyk.
Proof Omitted in class, but here it is.
i. x ⊗ (1, −1) is a permutation of y ⊗ (1, −1), so
kx ⊗ (1, −1)k = ky ⊗ (1, −1)k,
or equivalently
kxk k(1, −1)k = kyk k(1, −1)k,
and so kxk = kyk.
ii. Immediate from last part.
iii. For each a ∈ {1, −1}I , have vector (ai yi )i∈I ∈ RI . Let S be the set of
I has n elements then S has 2n elements. Then the
such vectors. So if Q
convex hull of S is i∈I [−yi , yi ], which contains x:
(−y1 , y2 ) y = (y1 , y2 )
x = (x1 , x2 )
But every vector in S has norm kyk (by first part), so kxk ≤ kyk by
triangle inequality.
29
Step 3: the case p = ∞ If p = ∞, easy to show k · k = k · k∞ . (Omitted in
class, but here it is.) Let p = ∞, that is, c = 0. By Lemma 4.14, enough to
prove kxk = kxk∞ for all x ∈ Rn such that xi ≥ 0 for all i. Choose j such that
xj = kxk∞ . Using Lemma 4.14(iii),
N (w, t) = {j ∈ J : wj ≥ t} .
For r ≥ 1, write r
x⊗r = x ⊗ · · · ⊗ x ∈ Rn .
Then Proposition 4.10 states that
or equivalently
1/r
ur , . . . , ur
kxkp = sup . (9)
u>0,r≥1 | {z }
N (x⊗r ,ur )
Step 5: the lower bound First show kxk ≥ kxkp . By (9), enough to show
that for each u > 0 and r ≥ 1,
1/r
ur , . . . , u r
kxk ≥ .
| {z }
N (x⊗r ,ur )
kx⊗r k ≥ ur , . . . , ur .
| {z }
N (x⊗r ,ur )
kx⊗r k ≥ ur , . . . , ur , 0, . . . , 0 = ur , . . . , ur .
| {z } | {z }
N (x⊗r ,ur ) N (x⊗r ,ur )
| {z }
nr
30
Step 6: the upper bound Now prove kxk ≤ θkxkp for each θ > 1. Since
mini xi > 0, can choose u0 , . . . , ud such that
kx⊗r k ≤ kyr k
= k(ur1 , . . . , ur1 , . . . , urd , . . . , urd )k
d
X
kx⊗r k ≤ urk , . . . , urk
| {z }
k=1
≤N (x⊗r ,urk−1 ) terms
ur , . . . , ur
≤ d max
1≤k≤d | k {z k}
≤N (x⊗r ,urk−1 )
by (9). Hence kxk = kx⊗r k1/r ≤ d1/r θkxkp . Letting r → ∞, get kxk ≤ θkxkp .
True for all θ > 1, so kxk ≤ kxkp , completing the proof of Theorem 4.13.
Next week: how to measure biological diversity, and how diversity measures
are related to norms, means and entropy.
31
E.g. consider two communities, A and B. Which is more diverse?
Definition 5.1 Let q ∈ [0, ∞]. The Hill number (or diversity) of order q
of p is defined for q 6= 1, ∞ by
X 1
1−q
Dq (p) = pqi
i : pi >0
Special cases:
• D0 (p) is species richness, i.e. number of species present. Most common
meaning of ‘diversity’ in both general media and ecological literature.
• D1 (p) = exp(H1 (p)). Doesn’t depend on choice of base for logarithm.
32
for communities A and B above:
Dq (p) A
Always decreasing and continuous, for reasons we’ll come back to.
Properties of Hill numbers: let q ∈ [0, ∞].
• Dq is an effective number: writing un = (1/n, . . . , 1/n),
Dq (un ) = n.
33
Proposition 5.3 Let p ∈ ∆n and x ∈ (R+ )n . Then Mt (p, x) is (non-strictly)
increasing and continuous in t ∈ [−∞, ∞].
Proof See e.g. Hardy, Littlewood and Pólya’s book Inequalities, Theorem 16.
Rényi entropies
Definition 5.5 Let p ∈ ∆n and q ∈ [0, ∞]. The Rényi entropy of p of
order q is
Hq (p) = log Dq (p).
Hq (p ⊗ r) = Hq (p) + Hq (r)
34
Back to diversity Important feature of Dq is modularity, as follows. Take
n islands of varying sizes, sharing no species. Then the diversity of their union
depends only on the diversities and relative sizes of the islands—not on their
internal structure.
That is: write w = (w1 , . . . , wn ) ∈ ∆n for the relative sizes of the islands.
Write pi = (p11 , . . . , p1k1 ) for the relative abundances of the species on island i.
Then the species distribution for the whole community is
w ◦ (p1 , . . . , pn )
(e−βE1 , . . . , e−βEn )
Z(β)
It’s a theorem that the Hill numbers are the only diversity measures satisfy-
ing the chain rule (10), at least for q 6= 0, 1, ∞. Maybe not very useful, as why
would anyone ask for this chain rule? Anyway, here’s the result:
Theorem 5.7 Let q ∈ (0, 1) ∪ (1, ∞). Let (D : ∆n → [1, ∞))n≥1 be a sequence
of functions. TFAE:
i. D is continuous, symmetric, and satisfies the q-chain rule (10);
ii. D = Dq or D ≡ 1.
Proof Omitted. As far as I know, this result doesn’t exist in the literature. I’d
be happy to explain the proof to anyone interested.
35
Next week: we incorporate similarity between species into our model. This
not only improves biological accuracy; it also leads to interesting new mathe-
matics.
Week IX (4 Apr)
Recap: crude model of community as probability distribution; definition
of Mt (p, x), of Dq (p) as M1−q (p, 1/p) (diversity as average rarity, with 1/p
defined coordinatewise) and explicit form, and Hq (p) = log Dq (p).
Similarity-sensitive diversity
Ref: Leinster and Cobbold, Ecology, 2012.
Model: community of n species, with:
• relative abundances p = (p1 , . . . , pn ) ∈ ∆n ;
• for each i, j, a similarity coefficient Zij ∈ [0, 1], giving matrix Z = (Zij ).
Assume Zii = 1 for all i.
Examples 5.8 i. Z = I: then different species have nothing in common
(naive model).
ii. Zij = percentage genetic similarity between species i and j.
iii. Or measure similarity phylogenetically, taxonomically, morphologically,
...
iv. Given metric d on {1, . . . , n}, put Zij = e−d(i,j) .
v. (Non-biological?) Given reflexive graph with vertices 1, . . . , n, put
(
1 if there is an edge between i and j
Zij =
0 otherwise.
Reflexive means that there is an edge from each vertex to itself.
Viewing p as a column vector, we have
X
(Zp)i = Zij pj
j
36
Examples 5.10 i. Naive model Z = I: have DqI (p) = Dq (p). So the
classical quantities Dq (p)—the exponentials of Shannon entropy etc.—
have a big problem: they implicitly assume that different species have
nothing in common. This is of course nonsense.
ii. D2Z (p) = 1/ i,j pi Zij pj . This is the reciprocal of expected similarity
P
between a randomly-chosen pair of individuals. You can measure this even
in situations (e.g. microbial) where you don’t have a species classification
at all. Something similar can be done for all integers q ≥ 2.
iii. Generally: incorporating similarity can change judgement on when one
community is more diverse than another. See our paper for examples.
Broad idea: it may be that one community initially looks more diverse,
but all the diversity is within a group of species that are highly similar,
in which case it’s really not so diverse after all.
Properties of DqZ , for each fixed q (and here I’ll skip some obvious ones):
• Splitting a species into two identical species doesn’t change the diversity.
Mathematically: add a new column to the matrix identical to the last one,
and a new row identical to the last one. Also split pn as a sum of two
probabilities however you like.
Consequence: by continuity, splitting a species into two highly similar
species causes only a slight increase in diversity. This is sensible, logical
behaviour. Species classifications can be somewhat arbitrary and change-
able.
• Modularity: take m islands, such that species on different islands are
distinct and totally dissimilar. Then the diversity of the whole depends
only onPthe diversities di and relative sizes wi of the islands. (‘Relative’
means wi = 1.) The important thing here is the functional relationship
just stated. But I’ll also give the actual formula. Explicitly, the diversity
of the whole is X 1/(1−q)
wiq d1−q
i (11)
i : wi >0
37
Digression: entropy viewpoint Define
Maximizing diversity Suppose we fix a list of species and can control their
relative abundances. What relative abundance distribution would we choose
in order to maximize the diversity, and what is the value of that maximum
diversity?
Or more mathematically: fix a finite metric space. Which probability dis-
tribution(s) maximize the diversity (or equivalently entropy), and what is the
value of the maximum diversity?
Fix q and a similarity matrix Z. Questions:
A Which distribution(s) p maximize DqZ (p)?
1
3
2
Recall that even in the case Z = I, different values of q give different judge-
ments on which communities are more diverse than which others. They rank
distributions p differently. So: In principle, the answers to both A and B
depend on q. But:
Theorem 5.14 (with Mark Meckes) Let Z be a symmetric similarity ma-
trix. Then:
A There is a distribution p that maximizes DqZ (p) for all q ∈ [0, ∞] simul-
taneously.
B supp DqZ (p) is independent of q ∈ [0, ∞].
38
So, there is a best of all possible worlds.
Remarks 5.15 i. For Z = I, trivial: example above.
vi. Any distribution p that maximizes DqZ (p) for one q > 0 actually max-
imizes DqZ (p) for all q ∈ [0, ∞]. (Not obvious, but in the paper with
Meckes.)
Next time: I’ll describe a more general family of measures and prove that in
a certain sense, it is the only sensible way of assigning a value to a collection of
things. I don’t presuppose that this ‘value’ is any kind of diversity, nor that there
is anything ecological about the setting. But it turns out that it’s closely related
to the diversity measures that we just met, and that it’s especially interesting
to interpret these ‘value’ measures in an ecological setting.
Value
Recap: power means, Dq (p) = M1−q (p, 1/p), similarity matrices, DqZ (p) =
M1−q (p, 1/Zp), modularity formula ( wiq d1−q )1/(1−q) .
P
i
Today we’ll look at a very general notion of the ‘value’ of a collection of
objects. We’ll see:
39
Set-up Consider a finite set {1, . . . , n} together with:
• for each element i, a probability pi ∈ [0, 1]
• for each element i, a ‘value’ vi ∈ (0, ∞).
1/4 v2
1/2
v1
1/4 v
3
Probabilities as proportions.
Examples 5.16 i. Elements are species in some community, pi = relative
abundance of ith species in community, vi = 1 for all i.
ii. Can also imagine assigning monetary value to each species, e.g. revenue
from ecotourism or whatever.
iii. Same elements and same p, but (given a similarity matrix Z) put
pi p
vi = = Pi ≤ 1.
(Zp)i pi + Zij pj
j : j6=i
Basic question Given proportions and values of individual elements, how can
we assign a value to the whole set? What is the value of the whole in terms of
the values of the parts?
To answer this question is to give a sequence of functions
by
40
(p ∈ ∆n , v ∈ (0, ∞)n ), where the sum, product, max and min are over all i
such that pi > 0. (If pi = 0 then we appear to have the problem of vi /pi being
undefined, but it’s OK: the definition of M1−q only involves those i such that
pi > 0.)
Example 5.17 The case q = 0:
X
σ0 (p, v) = vi .
i : pi >0
Just add up the values. It’s the most obvious way to aggregate the individual
values into a single overall value, but not the only way! And it ignores the
probabilities, except for asking whether or not pi = 0.
iii. Take community divided into n subcommunities, such that species in dif-
ferent subcommunities are completely dissimilar. Let
Then
DqZ (whole community) = σq (p, v).
That’s exactly the earlier modularity formula (11).
41
Properties The following properties are satisfied by σ = σq for each q ∈
[−∞, ∞]. Start with most significant.
i. Homogeneity: σ(p, cv) = cσ(p, v) for all c > 0. Interpretation: σ(p, v)
is measured in same units as v1 , . . . , vn . E.g. if vi are in kg then so is
σ(p, v). A switch to grams would multiply both by 1000.
ii. Increasing: if vi ≤ vi0 for all i then σ(p, v) ≤ σ(p, v0 ) for all i. Idea: the
values of the parts make a positive contribution to the value of the whole.
iii. Effective number: σ(un , (1, . . . , 1)) = n for all n ≥ 1. Assuming homo-
| {z }
n
geneity, an equivalent statement is σ(un , (v, . . . , v)) = nv for all n and v.
Interpretation: if n equal-sized elements each have value v then value of
whole set is nv.
iv. Modularity: Suppose each element of our set is itself divided into ‘sub-
elements’, each with an assigned probability and value:
2/3 1/3
1/5 v31 1/8 v22
2/5v1
1
42
And now a big result:
Theorem 5.20 Let σ : (0, ∞)n → (0, ∞) n≥1
be a sequence of functions.
TFAE:
• σ satisfies properties (i)–(vii) above;
Proof is long: about 15 pages. But not so bad. If I’d known the proof when
I started the course, I’d have chosen topics differently so that by the time we
reached this point, we’d have done all the necessary preparatory results.
Outline of proof:
• Functional equation characterizing q-logarithms (for unspecified q). I.e. a
theorem saying that a function f satisfies certain properties if and only if
f = lnq for some q ∈ R.
• Classical theory of ‘quasiarithmetic’ means, i.e. those of the form
n
1X
mean(x1 , . . . , xn ) = φ−1 φ(xi )
n i=1
• Deduce that σ(p, v) = σq (p, v) for all p, using further assumed properties
of σ. (An intermediate step here is the case when p is rational, much as
in the proof of Faddeev’s theorem on Shannon entropy).
43
6 The diversity of a metacommunity
Week XI (18 Apr)
Previously, we looked at how to quantify the diversity of a single community,
maybe taking into account the varying similarities between species. We saw
that this was closely related to entropy, and called it the ‘effective number of
species’. A diversity of 8.2 means that your community is slightly more diverse
than a community of eight equally abundant, completely dissimilar species.
General context: a metacommunity (large ecological community) divided
into communities (e.g. geographical regions). Traditional terminology:
• α-diversity: average diversity of the individual communities.
• β-diversity: diversity between the communities, i.e. how much they vary
from each other.
• γ-diversity: diversity of the whole metacommunity (global).
These terms are imprecise; many specific proposals have been made for defin-
ing them. It’s not clear that you can sensibly partition diversity into within-
community/between-community components like this. Probably the initial ef-
forts to do this were inspired by ANOVA (analysis of variance, in statistics);
variance and diversity are both measures of variation or spread. But then the
notions of α-, β- and γ-diversity acquired their own independent standing.
This week: largely based on Reeve et al. (arXiv:1404.6520), which provides
non-traditional and highly flexible, sensitive approach to metacommunity diver-
sity. It introduces 14 measures. They have different meanings for conservation
etc. We’ll consider about half of them. It allows for varying q (i.e. varying
emphasis on rare/common species or communities) and varying Z (i.e. varying
similarities between species). But to keep things simple, we’ll work entirely in
the case Z = I (distinct species assumed to be completely dissimilar) and q = 1
(the best-understood value, corresponding to Shannon rather than Rényi etc.
entropies).
Information-theoretically: there are various entropy measures for a pair of
distributions (relative, cross, joint, conditional, mutual information, . . . ). Cru-
cial distinction: are the distributions on the same set or different sets?
I’ll explain these various entropy measures both abstractly and in terms of
biological diversity.
Today is a sort of overview of various aspects of entropy and diversity, leading
beyond the scope of this course.
44
• p be the species distribution for the metacommunity;
• r be the species distribution for some smaller community.
We’re already used to the idea that diversity is the exponential of entropy. So
let’s consider
Y ri ri
exp H(r k p) = .
pi
This is called β in Reeve et al., and measures how unusual or atypical the
community is relative the metacommunity:
• Maximum value is 1/w, where w ∈ [0, 1] is the size of the community
relative to the metacommunity. (To prove this: pi ≥ wri , so ri /pi ≤ 1/w.)
Attained when the species in the community occur nowhere else in the
metacommunity.
• Minimum value is 1, attained when r = p. (Then the community is an
absolutely typical sample of the metacommunity.)
• H(Y ), similarly;
1
P
• H(X, Y ) = x∈Ξ,y∈Ω P(x, y) log P(x,y) , joint entropy of X and Y . (Not
really a new definition; it’s just the entropy of the random variable (X, Y ).)
Highly suggestive diagram:
H(X) H(Y )
H(X, Y )
45
The appearance of the conditional probability here begins to suggest why
it’s called conditional entropy. A stronger reason is that
X
H(X | Y ) = P(y)H(X | Y = y)
y
• H(Y | X) similarly.
• Mutual information:
From this point of view, it’s nontrivial that I(X; Y ) is symmetric in X and Y .
All these quantities are ≥ 0, with I(X; Y ) = 0 iff X and Y are independent.
The Venn diagram isn’t just an analogy; it refers to a particular example:
Example 6.1 Let U and V be finite subsets of some set. Choose a subset Z
of U ∪ V uniformly at random, and put
X = Z ∩ U, Y = Z ∩ V.
U V
U \V U ∩V V \U
U ∪V
46
P
I’ll always use i for species and j for communities. So i,j Pij = 1, and have
matrix
P11 · · · P1N
P = . .. species
.. .
PS1 ··· PSN
communities
Then overall species distribution is
p1
.. X
p = . , pi = Pij
pS j
Now let X be the random variable for species (with distribution p) and Y the
RV for communities (with distribution w), so that joint distribution is P .
Now let’s go through the Venn diagram (table on next page), again following
the pattern that when we’re interested in diversity rather than entropy, we take
an exponential. Summary:
G = D1 (p) D1 (w)
A B R
47
Quantity Name in Formula Description Range Maximized when Minimized when
Reeve et al.
i
Q
exp H(X) G = D1 (p) i p−p
i Eff num species in metacomm [1, S] Species in metacommunity Only one species
(ignoring division into comms) are balanced present
Q −wj
exp H(Y ) D1 (w) j wj Eff num communities [1, N ] Communities equal size Only one comm
(ignoring division into species)
Q −Pij
exp H(X, Y ) A i,j Pij Eff num of (species, comm) pairs [1, N S] Comms are all balanced Single species
and equal size & single comm
Q P −Pij
exp H(X | Y ) A i,j wj ij Pij Average eff num of species [1, S] Comms are all balanced Each comm has
per community only one species
exp H(Y | X) R A/G Redundancy of communities [1, N ] Comms have same size Comms share no
and same composition species
48
exp I(X; Y ) B G/A Eff num of distinct comms [1, N ] Comms share no species Comms have same
and are same size composition
Enormous thanks to all the participants in this seminar course for your ques-
tions, contributions and encouragement.
49