Solutions
Solutions
Solutions
This is the solutions manual for Information Theory, Inference, and Learn-
ing Algorithms. It is supplied on request to instructors using this book in
their teaching; please email [email protected]. For the benet of
instructors, please do not circulate this document to students.
c _2003 David J.C. MacKay. Version 6.2 November 12, 2003.
Please send corrections or additions to these solutions to David MacKay,
[email protected].
Reminder about internet resources
The website
http://www.inference.phy.cam.ac.uk/mackay/itila
contains several resources for this book.
Extra Solutions for Chapter 1
Solution to exercise 1.4 (p.12). The matrix HG
T
mod2 is equal to the all-zero
3 4 matrix, so for any codeword t = G
T
s, Ht = HG
T
s = (0, 0, 0)
T
.
Solution to exercise 1.5 (p.13). (a) 1100 (b) 0100 (c) 0100 (d) 1111.
Solution to exercise 1.8 (p.13). To be a valid hypothesis, a decoded pattern
must be a codeword of the code. If there were a decoded pattern in which the
parity bits diered from the transmitted parity bits, but the source bits didnt
dier, that would mean that there are two codewords with the same source
bits but dierent parity bits. But since the parity bits are a deterministic
function of the source bits, this is a contradiction.
So if any linear code is decoded with its optimal decoder, and a decoding
error occurs anywhere in the block, some of the source bits must be in error.
Extra Solutions for Chapter 2
Solution to exercise 2.8 (p.30). Tips for Sketching the posteriors: best tech-
nique for sketching p
29
(1 p)
271
is to sketch the logarithm of the posterior,
dierentiating to nd where its maximum is. Take the second derivative at
the maximum in order to approximate the peak as exp[(p p
MAP
)
2
/2s
2
]
and nd the width s.
Assuming the uniform prior (which of course is not fundamentally right
in any sense, indeed it doesnt look very uniform in other bases, such as the
logit basis), the probability that the next outcome is a head is
n
H
+ 1
N + 2
(D.1)
(a) N = 3 and n
H
= 0:
1
5
;
701
702 Solutions manual
(b) N = 3 and n
H
= 2:
3
5
;
(c) N = 10 and n
H
= 3:
4
12
;
(d) N = 300 and n
H
= 29:
30
302
.
Solution to exercise 2.27 (p.37). Dene, for each i > 1, p
i
= p
i
/(1 p
1
).
H(p) = p
1
log 1/p
1
+
i>1
p
i
log 1/p
i
(D.2)
= p
1
log 1/p
1
+ (1 p
1
)
i>1
p
i
[log 1/(1 p
1
) + log 1/p
i
] (D.3)
= p
1
log 1/p
1
+ (1p
1
) log 1/(1p
1
) + (1p
1
)
i>1
p
i
[log 1/p
i
] (D.4)
Similar approach for the more general formula.
Solution to exercise 2.28 (p.38). P(0) = fg; P(1) = f(1g); P(2) = (1f)h;
P(3) = (1f)(1h); H(X) = H
2
(f) +fH
2
(g) +(1f)H
2
(h). dH(X)/df =
log[(1 f)/f] +H
2
(g) H
2
(h).
Solution to exercise 2.29 (p.38). Direct solution: H(X) =
i
p
i
log 1/p
i
=
i=1
(1/2
i
)i = 2. [The nal step, summing the series, requires mathemat-
ical skill, or a computer algebra system; one strategy is to dene Z() =
i=1
(1/2
i
), a series that is easier to sum (its Z = 1/(2
x,y
P(x, y)h(x, y) =
x,y
P(x, y)(h(x) +h(y)) (D.13)
=
_
x,y
P(x, y)h(x)
_
+
_
x,y
P(x, y)h(y)
_
. (D.14)
Because h(x) has no dependence on y, its easy to sum over y in the rst term.
y
P(x, y) = P(x). Summing over x in the second term similarly, we have
H(X, Y ) =
x
P(x)h(x) +
y
P(y)h(y) = H(X) +H(Y ).
Solution to exercise 4.9 (p.84). If six are weighed against six, then the rst
weighing conveys no information about the question which is the odd ball?
All 12 balls are equally likely, both before and after.
If six are weighed against six, then the rst weighing conveys exactly one
bit of information about the question which is the odd ball and is it heavy or
light? There are 24 viable hypotheses before, all equally likely; and after, there
are 12. A halving of the number of (equiprobable) possibilities corresponds to
gaining one bit. (Think of playing sixty-three.)
Solution to exercise 4.10 (p.84). Lets use our rule of thumb: always maximize
the entropy. At the rst step we weigh 13 against 13, since that maximizes the
entropy of the outcome. If they balance, we weigh 5 of the remainder against 4
of the remainder (plus one good ball). The outcomes have probabilities 8/26
(balance), 9/26, and 9/26, which is the most uniform distribution possible.
Solutions manual 705
Lets imagine that the 5 are heavier than the 4 plus 1. We now ensure that
the next weighing has probability 1/3 for each outcome: leave out any three of
the nine suspects, and allocate the others appropriately. For example, leaving
out HHH, weigh HLL against HLL, where H denotes a possibly heavy ball and
L a possibly light one. Then if those balance, weigh an omitted pair of Hs; if
they do not balance, weigh the two Ls against each other.
John Conways solution on page 86 of the book gives an explicit and more
general solution.
Solution to exercise 4.11 (p.84). Going by the rule of thumb that the most
ecient strategy is the most informative strategy, in the sense of having all
possible outcomes as near as possible to equiprobable, we want the rst weigh-
ing to have outcomes the two sides balance in eight cases and the two sides
do not balance in eight cases. This is achieved by initially weighing 1,2,3,4
against 5,6,7,8, leaving the other eight balls aside. Iterating this binary divi-
sion of the possibilities, we arrive at a strategy requiring 4 weighings.
The above strategy for designing a sequence of binary experiments by
constructing a binary tree from the top down is actually not always optimal;
the optimal method of constructing a binary tree will be explained in the next
chapter.
Solution to exercise 4.12 (p.84). The weights needed are 1, 3, 9, and 27. Four
weights in total. The set of 81 integers from 40 to +40 can be represented in
ternary, with the three symbols being interpreted as weight on left, weight
omitted, and weight on right.
Solution to exercise 4.14 (p.84).
(a) A sloppy answer to this question counts the number of possible states,
_
12
2
_
2
2
= 264, and takes its base 3 logarithm, which is 5.07, which exceeds
5. We might estimate that six weighings suce to nd the state of the
two odd balls among 12. If there are three odd balls then there are
_
12
3
_
2
3
= 1760 states, whose logarithm is 6.80, so seven weighings might
be estimated to suce.
However, these answers neglect the possibility that we will learn some-
thing more from our experiments than just which are the odd balls. Let
us dene the oddness of an odd ball to be the absolute value of the
dierence between its weight and the regular weight. There is a good
chance that we will also learn something about the relative oddnesses of
the two odd balls. If balls m and n are the odd balls, there is a good
chance that the optimal weighing strategy will at some point put ball m
on one side of the balance and ball n on the other, along with a load of
regular balls; if m and n are both heavy balls, say, the outcome of this
weighing will reveal, at the end of the day, whether m was heavier than
n, or lighter, or the same, which is not something we were asked to nd
out. From the point of view of the task, nding the relative oddnesses
of the two balls is a waste of experimental capacity.
A more careful estimate takes this annoying possibility into account.
In the case of two odd balls, a complete description of the balls, includ-
ing a ranking of their oddnesses, has three times as many states as we
counted above (the two odd balls could be odd by the same amount, or
by amounts that dier), i.e., 264 3 = 792 outcomes, whose logarithm
is 6.07. Thus to identify the full state of the system in 6 weighings is
706 Solutions manual
impossible at least seven are needed. I dont know whether the original
problem can be solved in 6 weighings.
In the case of three odd balls, there are 3! = 6 possible rankings of the
oddnesses if the oddnesses are dierent (e.g., 0 < A < B < C), six if
two of them are equal (e.g., 0 < A < B = C and 0 < A = B < C), and
just one if they are equal (0 < A = B = C). So we have to multiply
the sloppy answer by 13. We thus nd that the number of full system
states is 13 1760, whose logarithm is 9.13. So at least ten weighings
are needed to guarantee identication of the full state. I can believe that
nine weighings might suce to solve the required problem, but it is not
clear.
(b) If the weights of heavy, regular and light balls are known in advance,
the original sloppy method becomes correct. At least six weighings are
needed to guarantee identication of the two-odd-out-of-twelve, and at
least seven to identify three out of twelve.
Solution to exercise 4.16 (p.85). The curves
1
N
H
(Y
N
) as a function of for
N = 1, 2, 3 and 100 are shown in gure D.1. Note that H
2
(0.5) = 1 bit.
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
N=1
N=2
N=3
N=100
N = 2
1
N
H
(Y) 2
H
(Y)
00.25 1 4
0.250.5 0.79248 3
0.50.75 0.5 2
0.751 0 1
N = 3
1
N
H
(Y) 2
H
(Y)
00.125 1 8
0.1250.25 0.93578 7
0.250.375 0.86165 6
0.3750.5 0.77398 5
0.50.625 0.66667 4
0.6250.75 0.52832 3
0.750.875 0.33333 2
0.8751 0 1
Figure D.1.
1
N
H
(Y) (vertical
axis) against (horizontal), for
N = 1, 2, 3, 100 binary variables
with p
1
= 0.5.
Solution to exercise 4.19 (p.85). Cherno bound. Let t = exp(sx) and =
exp(sa). If we assume s > 0 then x a implies t .
Assuming s > 0, P(x a) = P(t )
t/ =
x
P(x) exp(sx)/ exp(sa) =
e
sa
g(s).
Changing the sign of s means that instead x a implies t ; so assuming
s < 0, P(x a) = P(t ); the remainder of the calculation is as above.
Extra Solutions for Chapter 5
Solution to exercise 5.19 (p.102). The code 00, 11, 0101, 111, 1010, 100100,
0110 is not uniquely decodeable because 11111 can be realized from c(2)c(4)
and c(4)c(2).
Solution to exercise 5.20 (p.102). The ternary code 00, 012, 0110, 0112, 100, 201, 212, 22
is uniquely decodeable because it is a prex code.
Solution to exercise 5.23 (p.102). Probability vectors leading to a free choice
in the Human coding algorithm satisfy p
1
p
2
p
3
p
4
0 and
p
1
= p
3
+p
4
. (D.15)
Solutions manual 707
The convex hull of Q is most easily obtained by turning two of the three
inequalities p
1
p
2
p
3
p
4
into equalities, and then solving equation (D.15)
for p. Each choice of equalities gives rise to one of the set of three vectors
1/
3,
1/
3,
1/
6,
1/
6,
2/
5,
1/
5,
1/
5,
1/
5 and
1/
3,
1/
3,
1/
3, 0. (D.16)
Solution to exercise 5.24 (p.103). An optimal strategy asks questions that
have a 50:50 chance of being answered yes or no. An essay on this topic
should discuss practical ways of approaching this ideal.
Solution to exercise 5.25 (p.103). Lets work out the optimal codelengths.
They are all integers. Now, the question is, can a set of integers satisfying
the Kraft equality be arranged in an appropriate binary tree? We can do this
constructively by going to the codeword supermarket and buying the shortest
codewords rst. Having bought them in order, they must dene a binary tree.
Solution to exercise 5.27 (p.103).
a
i
p
i
log
2
1
p
i
l
i
c(a
i
)
a 0.09091 3.5 4 0000
b 0.09091 3.5 4 0001
c 0.09091 3.5 4 0100
d 0.09091 3.5 4 0101
e 0.09091 3.5 4 0110
f 0.09091 3.5 4 0111
g 0.09091 3.5 3 100
h 0.09091 3.5 3 101
i 0.09091 3.5 3 110
j 0.09091 3.5 3 111
k 0.09091 3.5 3 001
The entropy is log
2
11 = 3.4594 and the expected length is L = 3
5
11
+
4
6
11
which is 3
6
11
= 3.54545.
Solution to exercise 5.28 (p.103). The key steps in this exercise are all spelled
out in the problem statement. Diculties arise with these concepts: (1) When
you run the Human algorithm, all these equiprobable symbols will end up
having one of just two lengths, l
+
= log
2
I| and l
= log
2
I|. The steps up
to (5.32) then involve working out how many have each of these two adjacent
lengths, which depends on how close I is to a power of 2. (2) The excess length
was only dened for integer I, but we are free to nd the maximum value is
attains for any real I; this maximum will certainly not be exceeded by any
integer I.
Solution to exercise 5.29 (p.103). The sparse source T
X
= 0.99, 0.01 could
be compressed with a Human code based on blocks of length N, but N would
need to be quite large for the code to be ecient. The probability of the all-0
sequence of length N has to be reduced to about 0.5 or smaller for the code
to be ecient. This sets N log 0.5/ log 0.99 = 69. The Human code would
then have 2
69
entries in its tree, which probably exceeds the memory capacity
of all the computers in this universe and several others.
There are other ways that we could describe the data stream. One is run-
length encoding. We could chop the source into the substrings 1, 01, 001, 0001, 00001, . . .
with the last elements in the set being, say, two strings of equal maximum
length 00 . . . 01 and 00 . . . 00. We can give names to each of these strings and
708 Solutions manual
compute their probabilities, which are not hugely dissimilar to each other.
This list of probabilities starts 0.01, 0.0099, 0.009801, . . .. For this code to
be ecient, the string with largest probability should have probability about
0.5 or smaller; this means that we would make a code out of about 69 such
strings. It is perfectly feasible to make such a code. The only diculty with
this code is the issue of termination. If a sparse le ends with a string of 20
0s still left to transmit, what do we do? This problem has arisen because we
failed to include the end-of-le character in our source alphabet. The best
solution to this problem is to use an arithmetic code as described in the next
chapter.
Solution to exercise 5.30 (p.103). The poisoned glass problem was intended to
have the solution 129, this being the only number of the form 2
m
+1 between
100 and 200. However the optimal strategy, assuming all glasses have equal
probability, is to design a Human code for the glasses. This produces a binary
tree in which each pair of branches have almost equal weight. On the rst
measurement, either 64 or 65 of the glasses are tested. (Given the assumption
that one of the glasses is poisoned, it makes no dierence which; however, going
for 65 might be viewed as preferable if there were any uncertainty over this
assumption.) There is a 2/129 probability that an extra test is needed after
seven tests have occurred. So the expected number of tests is 7
2
129
, whereas
the strategy of the professor takes 8 tests with probability 128/129 and one
test with probability 1/129, giving a mean number of tests 7
122
129
. The expected
waste is 40/43 tests.
Extra Solutions for Chapter 6
Solution to exercise 6.2 (p.117). Lets assume there are 128 viable ASCII char-
acters. Then the Human method has to start by communicating 128 integers,
each of which could in principle be as large as 127 or as small as 1, but plausi-
ble values will range from 2 to 17. There are correlations among these integers:
if one of them is equal to 1, then none of the others can be 1. For practical
purposes we might say that all the integers must be between 1 and 32 and use
a binary code to represent them in 5 bits each. Then the header will have a
size of 5 128 = 640bits.
If the le to be compressed is short 400 characters, say then (taking 4 as
a plausible entropy per character, if the frequencies are known) the compressed
length would be 640 (header) + 1600 (body) 2240, if the compression of
the body is optimal. For any le much shorter than this, the header is clearly
going to dominate the le length.
When we use the Laplace model, the probability distribution over charac-
ters starts out uniform and remains roughly so until roughly 128 characters
have been read from the source. In contrast, the Dirichlet model with = 0.01
only requires about 2 characters to be read from the source for its predictions
to be strongly swung in favour of those characters.
For sources that do use just a few characters with high probability, the
Dirichlet model will be better. If actually all characters are used with near-
equal probability then = 1 will do better.
The special case of a large le made entirely of equiprobable 0s and 1s is
interesting. The Human algorithm has to assign codewords to all the other
characters. It will assign one of the two used characters a codeword of length
1, and the other gets length 2. The expected lelength is thus more than
(3/2)N, where N is the source le length. The arithmetic codes will give an
expected lelength that asymptotically is N.
Solutions manual 709
It is also interesting to talk through the case where one character has
huge probability, say 0.995. Here, the arithmetic codes give a lelength thats
asymptotically less than N, and the Human method tends to N from above.
Solution to exercise 6.4 (p.119). Assume a code maps all strings onto strings
of the same length or shorter. Let L be the length of the shortest string that is
made shorter by this alleged code, and let that string be mapped to an output
string of length l. Take the set of all input strings of length less than or equal
to l, and count them. Lets say there are n
in
(l) of length l. [n
in
(l) = A
l
, where
A is the alphabet size.]
Now, how many output strings of length l do these strings generate? Well,
for any length < L, by the denition of L, all the input strings were mapped to
strings with the same length. So the total number of output strings of length
l must be at least n
in
(l) +1, since not only all the inputs of length l, but also
the shortest input string, dened above, maps to an output of length l.
By the pigeonhole principle, thats too many pigeons for the available holes.
Two of those output strings must be identical, so the mapping is not uniquely
decodeable.
Solution to exercise 6.7 (p.123). Figure D.2 shows the left-hand side of the
arithmetic encoder for the case N = 5, K = 2.
1
1
0
0
0
1
0
0
0
1
1
0
0
1
1
0
0
1
0
0
1
0
1
0
1
0
1
1
1
0
0
0
0
Figure D.2. Arithmetic encoder
for binary strings of length N = 5
with xed weight K = 2. (The
right-hand side, a regular binary
scale, has been omitted.)
Solution to exercise 6.9 (p.123). Using the Human algorithm we arrive at
the symbol code shown in the margin. The expected length is roughly 1. The
ai pi h(pi) li c(ai)
111 1e-6 19.9 5 00000
110 1e-4 13.3 5 00001
101 1e-4 13.3 5 00010
011 1e-4 13.3 5 00011
001 0.0098 6.7 3 001
010 0.0098 6.7 3 010
100 0.0098 6.7 3 011
000 0.97 0.0 1 1
entropy of x is 0.24. The ratio length / entropy is 4, to 1 decimal place.
An arithmetic code for a string of length N = 1000, neglecting the termi-
nation overhead, gives an expected length equal to N times the entropy, i.e.
80 bits.
The variance of the length is found from the variance of the number of 1s,
which is Npq; the length is linearly related to the number of 1s, r
l(r) = r log
1
f
1
+ (N r) log
1
f
0
= r log
f
0
f
1
+N log
1
f
0
, (D.17)
so the standard deviation is 3.14 log[f
0
/f
1
] = 21. So the compressed length
is expected to be 80 21 bits. Or at most two more than this, allowing for
worst-case termination.
710 Solutions manual
Solution to exercise 6.10 (p.124). One can generate strings with density f by
running dense random bits into the decoder corresponding to the arithmetic
encoder for a sparse source with density f. See gure D.3.
a
b
aa
ab
aaa
aab
aaaa
aaab
aaba
aabb
aba
abb
abaa
abab
ba
bb
baa
bab
baaa
baab
0
1
00
01
000
001
0000
0001
00000
00001
00010
00011
0010
0011
00100
00101
00110
00111
010
011
0100
0101
01000
01001
01010
01011
0110
0111
01100
01101
01110
01111
10
11
100
101
1000
1001
10000
10001
10010
10011
1010
1011
10100
10101
10110
10111
110
111
1100
1101
11000
11001
11010
11011
1110
1111
11100
11101
11110
11111
Figure D.3. Arithmetic encoder
for a sparse source with f = 1/6.
Solution to exercise 6.11 (p.124). The encoding is 001101011000000110100101000011,
coming from the following parsed message:
(, 0), (0, 1), (1, 0), (10, 1), (10, 0), (00, 0), (011, 0), (100, 1), (010, 0), (001, 1)
The highlighted symbols would be omitted in the further improved coding
system.
Extra Solutions for Chapter 6.8
Solution to exercise 6.15 (p.125). Using the Human coding algorithm, we
a
i
p
i
h(p
i
) l
i
c
i
a 0.01 6.6 6 000000
b 0.02 5.6 6 000001
c 0.04 4.6 5 00001
d 0.05 4.3 4 0010
e 0.06 4.1 4 0011
f 0.08 3.6 4 0001
g 0.09 3.5 3 100
h 0.1 3.3 3 101
i 0.25 2.0 2 11
j 0.3 1.7 2 01
arrive at the answer shown, which is unique (apart from trivial modications
to the codewords).
The expected length is 2.81. The entropy is 2.78.
Solution to exercise 6.16 (p.125). The entropy of y = x
1
x
2
is twice H(X);
H(X) = 1.295 bits so H(Y) = 2.59 bits.
Solutions manual 711
The optimal binary symbol code is constructed using the Human coding
algorithm. There are several valid answers; the codelengths should be iden-
tical to one of the two lists below. The strings ab and ba, marked , are
interchangeable.
a
i
p
i
h(p
i
) l
(1)
i
c(a
i
) l
(2)
i
c
(2)
(a
i
)
aa 0.01 6.6 6 000000 6 000000
ab
p
n
log(1/p
n
)/
p
n
l
n
subject to normalization. gives (dS/dp
n
LdL/dp
n
S)/L
2
= gives dS/dp
n
=
Rl
n
+L, with dS/dp
n
= 1 log p
n
. Thus p
n
= exp(Rl
n
)/Z.
Notice that this distribution has two properties: dlog Z/dR = L
S = log Z +RL
S/L = log Z/L +R
this instantly means log Z = 0 without my having to do any dierentiation!
Solution to exercise 6.19 (p.126). There are 52! orderings of a pack of cards, so
the minimum number of bits required to make a perfect shue is log
2
(52!)
226 bits.
Solution to exercise 6.20 (p.126). (Draft solution, more below.)
(a) After the cards have been dealt, the number of bits needed for North
to convey her hand to South (remember that he already knows his own
hand) is
log
2
_
39
13
_
33 bits. (D.18)
Now, North does not know Souths hand, so how, in practice, could this
information be conveyed eciently? [This relates to the Slepian-Wolf
correlated information comunication problem.]
(b) The maximum number of bits is equal to 35, the number of distinct
bids in the list 1. . . 7NT. Given the assumption that E and W do
not bid, the bidding process can be viewed as dening a binary string of
length 35, with a 1 against every bid that was made by N or S, and a 0
against every bid that was skipped. The complete bidding history can
be reconstructed from this binary string, since N and S alternate (we
assumed that the bidding stops if either of them does not bid). So the
maximum total information conveyed cannot exceed 35 bits.
A bidding system that achieved this maximum information content would
be one in which a binary code is agreed such that 0s and 1s are equally
712 Solutions manual
probable; then each bidder chooses the next bid by raising the bid by
the appropriate number of notches. There will be a probability of
1/
2
that they raise the bid by one notch; a probability of
1/
4 that they raise
it by two notches; and so forth.
Solution to exercise 6.20 (p.126). (a) From the point of view of South, the
information content of Norths hand is log
_
(5213)
13
_
33 bits.
(b) The list of bids not made and bids made forms a binary string. We
could write the omitted bids as 0s and the made bids as 1s. Then North and
South, by their raises, are dening a binary string of length 35. They can thus
convey a total of at most 35 bits to each other. Its conceivable therefore that
each could learn about half of the information content of the others hand (33
bits).
Solution to exercise 6.21 (p.126). (First of two solutions.)
(a) The arabic keypad can produce the times 0:010:09 in two symbols and
the times 0:100:59 in three symbols. The roman keypad can produce
the times 0:01, 0:10, 1:00, and 10:00 in two symbols, and 0:02, 0:11,
0:20, 1:01, 1:10, 2:00, 20:00, 11:00, 10:10, and 10:01 in three symbols.
The times 0:11, 1:01, 1:10, 11:00, 10:10, and 10:01 can all be produced
in two dierent ways, because the two keys with numbers can be pressed
in either sequence.
(b) The arabic code is incomplete because
i. The keys 0 and 2 are both illegal rst symbols.
ii. After a four-digit number has been entered, the only legal symbol
is 2.
The roman code is incomplete because
i. The key 2 is an illegal rst symbol.
ii. Some times can be produced by several symbol-strings. A time
such as 1:23 can be entered as CXXIII2 or as IICIXX2.
iii. After a key has been pressed a number of times (ve or nine, de-
pending which key) it may not be pressed any more times.
(c) The arabic code can produce 3:15, 2:30, and 5:00 in four symbols, and
the roman code cannot. The roman code can produce 12:00 and 21:00
in four symbols, and the arabic code cannot.
(d) Both codes allow the time 0:01 to be encoded with a very short sequence,
length two. This is implicitly one of the most probable times for both
models. In the arabic model, the implicit probability of a time is roughly
1/11
l+1
, where l is the length of the time when encoded in decimal. In
the roman model, times that contain many ones and zeroes are the most
probable, with the probability of a time decreasing roughly as the sum
of its digits: P(x) 1/5
s
, where s =
i
x
i
.
(e) When I use the microwave, my probability distribution is roughly:
x 0:10 0:20 0:30 1:00 1:10 1:20 1:30 2:00 3:00
P(x) 0.1 0.05 0.01 0.1 0.01 0.1 0.5 0.03 0.03
x 5:00 7:00 8:00 10:00 12:00 20:00 other
P(x) 0.02 0.02 0.02 0.01 0.01 0.01
Solutions manual 713
The arabic system is poorly matched to my distribution because it forces
me to push the zero button at the end of every time, to specify zero
seconds, which I always want. The roman system similarly wastes an
entire button (the I button) which I never use. The arabic system is
otherwise quite well matched to my probability distribution, except that
my favourite time (1:30 for a cafe latte) could do with a shorter sequence.
The roman system is wellmatched to some of my times, but terribly
matched to others, in particular, the time 8:00.
The arabic system has a maximum codelength of ve symbols. The
roman system has a terrible maximum codelength of 28 symbols, for the
time 59:59.
(f) An alternative encoder using ve buttons would work as follows.
i. The display starts by oering as a default the median of the last
one hundred times selected. If the user has a favourite time, then
this will probably be oered. It can be accepted by pressing the 2
key.
ii. The other four keys are marked +, ++, , and .
The + symbol increments the displayed time by a little bit,
e.g.16%.
The symbol decrements the displayed time by a little bit,
e.g.16%.
The ++ symbol increments the displayed time by a lot, e.g., a
factor of two.
The symbol decrements the displayed time by a lot, e.g., a
factor of two.
To make this system even more adaptive to its user, these four
buttons could have their eect by moving the percentile around the
distribution of times recently used by the user. The initial time is
the median. The + button takes us to the 63rd percentile and ++
takes us to the 87th, say, with the step size decreasing adaptively as
the time is selected. If a user has ve preferred times, these could
adaptively be discovered by the system so that, after time, the ve
times would be invoked by the sequences 2, 2, 2, + 2,
and ++ 2 respectively.
Solution to exercise 6.21 (p.126). (a) The arabic microwave can generate the
times 0, 1, 2, . . . 9 seconds with two symbols, and the times from 0 to 59
seconds (or perhaps 99 seconds) with three symbols. Not a very good use of
the shortest strings!
The roman microwave can generate 1 second, 10 seconds, 1 minute, and 10
minutes with two symbols, and any of the 10 sums of those 4 quantities with
three symbols. Again, not a good use of the shortest strings, except perhaps
1 minute and 10 minutes. Also there is a bit of ineciency: the sequences CX
and XC both produce 1 minute and 10 seconds.
(b) The codes are not complete. In a complete code, there would be a
unique way to encode each cooking time, and there would be no redundant
symbols (whereas all times in both codes end with Start, which is in at least
some cases redundant).
(c) 1 minute 23 seconds; 30 minutes.
(d) The implicit probability distribution over digits is uniform with arabic,
and the distribution over the number of non-zero digits is an exponential, with
714 Solutions manual
short strings (such as 8 seconds) being more probable than long ones (such as
10 minutes).
The implicit distribution over digits is highly nonuniform in the roman
microwave. The digit 1 is much more probable than the digit 9.
For most large times (such as twenty minutes) there are intermediate and
small times (such as twenty seconds, and two seconds) that have the same
implicit probability. So the roman distribution, while rather strange, does
have a scale-free property.
(e) My distribution over times has a lot of mass around 1:30, 2:00, and
4:00.
(f) No-one needs a ten-minute cooking event timed to within a second.
An ecient system would have variable grain size, with 1-second precision
at the ten-second end of things, 5-second precision in the 1-minute area, and
twenty-second precision at ten minutes.
Extra Solutions for Chapter 7
Solution to exercise 7.2 (p.134).
The self-delimiting representations of n are:
c
(n) = 0000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000
10000111101100110000111101011100100110010101000001010011110
100011000011101110110111011011111101110.
c
(n) = 000000
1100010
0000111101100110000111101011100100110010101000001010011110
100011000011101110110111011011111101110.
c
(n) = 00111
100010
0000111101100110000111101011100100110010101000001010011110
100011000011101110110111011011111101110.
c
(n) = 011
11
100010
0000111101100110000111101011100100110010101000001010011110
100011000011101110110111011011111101110.
Byte based, base 3.
010010100110101010011010010110000010101001010101010110001010010
110011000010000010010100101000010101010010100101001100001100011
base 7
011000101000101000100011100110010100001100000110011001101100000
100000100000010101000011001011100100010100111
Solutions manual 715
base 15
100111101010010000100110010000101011100011101011101000001110111
00011101110101010111001111101110100001111
Solution to exercise 7.3 (p.135). A code that has shorter codelengths asymp-
totically (e.g., for n > 2
100
) uses the same idea but rst encodes the number
of levels of recursion that the encoder will go through, using any convenient
prex code for integers, for example C
x,y
P(x, y) log
P(x)P(y)
P(x, y)
2
. (D.22)
is fairly easily shown to satisfy the rst three axioms D
H
(X, Y ) 0, D
H
(X, X) =
0, D
H
(X, Y ) =D
H
(Y, X).
A proof that it obeys the triangle inequality is not so immediate. It helps
to know in advance what the dierence D(X, Y ) +D(Y, Z) D(X, Z) should
add up to; this is most easily seen by rst making a picture in which the
quantities H(X), H(Y ), and H(Z) are represented by overlapping areas, c.f.
gure 8.2 and exercise 8.8 (p.141). Such a picture indicates that D(X, Y ) +
D(Y, Z) D(X, Z) = H(Y [X, Z) +I(X; Z[Y ).
D(X, Y ) +D(Y, Z) D(X, Z)
=
x,y,z
P(x, y, z) log
P(x)P(y)P(y)P(z)P(xz)
2
P(xy)
2
P(x)P(z)P(y, z)
2
(D.23)
= 2
x,y,z
P(x, y, z) log
P(x, z)P(x, z[y)
P(x, y, z)P(x[y)P(z[y)
(D.24)
= 2
x,y,z
P(x, y, z)
_
log
1
P(y[xz)
+ log
P(x, z[y)
P(x[y)P(z[y)
_
(D.25)
= 2
x,z
P(x, z)
y
P(y[x, z) log
1
P(y[x, z)
+ (D.26)
2
y
P(y)
x,z
P(x, z[y) log
P(x, z[y)
P(x[y)P(z[y)
(D.27)
= 2
x,z
P(x, z)H(Y [x, z) + 2
y
P(y)I(X; Z[y). (D.28)
= 2H(Y [X, Z) + 2I(X; Z[Y ). (D.29)
The quantity I(X; Z[Y ) is a conditional mutual information, which like a
mutual information is positive. The other term H(Y [X, Z) is also positive, so
D(X, Y ) +D(Y, Z) D(X, Z) 0.
716 Solutions manual
Solution to exercise 8.10 (p.142). Seeing the top of the card does convey in-
formation about the colour of its other side. Bayes theorem allows us to draw
the correct inference in any given case, and Shannons mutual information is
the measure of how much information is conveyed, on average.
This inference problem is equivalent to the three doors problem. One
quick way to justify the answer without writing down Bayes theorem is The
probability that the lower face is opposite in colour to the upper face is always
1/3, since only one of the three cards has two opposite colours on it.
The joint probability of the two colours is
P(u, l) u = 0 u = 1
l = 0
1/
3
1/
6
l = 1
1/
6
1/
3
(D.30)
The marginal entropies are H(U) = H(L) = 1 bit, and the mutual information
is
I(U; L) = 1 H
2
(
1/
3) = 0.08 bits. (D.31)
Extra Solutions for Chapter 9
Solution to exercise 9.17 (p.155). The conditional entropy of Y given X is
H(Y [X) = log 4. The entropy of Y is at most H(Y ) = log 10, which is
achieved by using a uniform input distribution. The capacity is therefore
C = max
P
X
H(Y ) H(Y [X) = log
10/
4 = log
5/
2 bits. (D.32)
Solutions manual 717
Solution to exercise 9.19 (p.156). The number of recognizable 2s is best es-
Figure D.4. Four random samples
from the set of 2
219
miniature 2s
dened in the text.
timated by concentrating on the type of patterns that make the greatest con-
tribution to this sum. These are patterns in which just a small patch of the
pixels make the shape of a 2 and most of the other pixels are set at random. It
is unlikely that the random pixels will take on some other recognizable shape,
as we will conrm later. A recognizable letter 2 surrounded by a white border
can be written in 67 pixels. This leaves 214 pixels that can be set arbitrarily,
and there are also 12 11 possible locations for the miniature 2 to be placed,
and two colourings (white on black / black on white). There are thus about
12 11 2 2
214
2
219
miniature 2 patterns, almost all of which are recog-
nizable only as the character 2. This claim that the noise pattern will not look
like some other character is conrmed by noting what a small fraction of all
possible patterns the above number of 2s is. Lets assume there are 127 other
characters to worry about. Only a fraction 2
37
of the 2
256
random patterns
are recognizable 2s, so similarly, of the 2
219
miniature 2 patterns identied
above, only a fraction of about 127 2
37
of them also contain another recog-
nizable character. These double-hits decrease undetectably the above answer,
2
219
.
Another way of estimating the entropy of a 2, this time banning the option
of including background noise, is to consider the number of decisions that are
made in the construction of a font. A font may be bold (2) or not bold;
italic (2) or not; sans-serif (2) or not. It may be normal size (2), small (2) or
tiny (2). It may be calligraphic, futuristic, modern, or gothic. Most of these
choices are independent. So we have at least 2
4
3
2
distinct fonts. I imagine
that Donald Knuths metafont, with the aid of which this document was
produced, could turn each of these axes of variation into a continuum so that
arbitrary intermediate fonts can also be created. If we can distinguish, say,
ve degrees of boldness, ten degrees of italicity, and so forth, then we can
imagine creating perhaps 10
6
2
20
distinguishable fonts, each with a distinct
2. Extra parameters such as loopiness and spikiness could further increase this
number. It would be interesting to know how many distinct 2s metafont can
actually produce in a 16 16 box.
The entropy of the probability distribution P(y [ x=2) depends on the
assumptions about noise and character size. If we assume that noise is unlikely,
then the entropy may be roughly equal to the number of bits to make a clean
2 as discussed above. The possibility of noise increases the entropy. The
largest it could plausibly be is the logarithm of the number derived above
for the number of patterns that are recognizable as a 2, though I suppose
one could argue that when someone writes a 2, they may end up producing
a pattern y that is not recognizable as a 2. So the entropy could be even
larger than 220 bits. It should be noted however, that if there is a 90% chance
that the 2 is a clean 2, with entropy 20 bits, and only a 10% chance that it
is a miniature 2 with noise, with entropy 220 bits, then the entropy of y is
H
2
(0.1) +0.1220+0.920 40 bits, so the entropy would be much smaller
than 220 bits.
Solution to exercise 9.21 (p.156). The probability of error is the probability
that the selected message is not uniquely decodeable by the receiver, i.e., it
is the probability that one or more of the S1 other people has the same
birthday as our selected person, which is
1
_
A1
A
_
S1
= 1 0.939 = 0.061. (D.33)
718 Solutions manual
The capacity of the communication channel is log 365 8.5 bits. The rate of
communication attempted is log 24 4.6 bits.
So we are transmitting substantially below the capacity of this noiseless
channel, and our communication scheme has an appreciable probability of
error (6%). Random coding looks a rather silly idea.
Solution to exercise 9.22 (p.157). The number of possible K-tuples is A
K
, and
we select q
K
such K-tuples, where q is the number of people in each of the K
rooms. The probability of error is the probability that the selected message is
not uniquely decodeable by the receiver,
1
_
A
K
1
A
K
_
q
K
1
. (D.34)
In the case q = 364 and K = 1 this probability of error is
1
_
1
1
A
_
q1
1 e
(q1)/A
1 e = 0.63. (D.35)
[The exact answer found from equation (D.34) is 0.631.] Thus random coding
is highly likely to lead to a communication failure.
As K gets large, however, we can approximate
1
_
A
K
1
A
K
_
q
K
1
= 1
_
1
1
A
K
_
q
K
1
q
K
1
A
K
_
q
A
_
K
. (D.36)
In the example q = 364 and A = 365, this probability of error decreases as
10
0.0012K
, so, for example, if K 6000 then the probability of error is smaller
than 10
6
.
For suciently large blocklength K, random coding becomes a reliable,
albeit bizarre, coding strategy.
Extra Solutions for Chapter 10
Solution to exercise 10.1 (p.168). Consider a string of bit pairs b
k
,
b
k
, having
the property that
K
k=1
P(
b
k
,= b
k
)/K = p
b
. These bits are concatenated in
blocks of size K = NR to dene the quantities s and s. Also, P(b
k
=1) =
1/2. We wish to show that these properties imply I(s; s) K(1 H
2
(p
b
)),
regardless of whether there are correlations among the bit errors.
More to come here.
Solution to exercise 10.12 (p.172). I(X; Y ) = H(X) H(X[Y ).
I(X; Y ) = H
2
(p
0
) qH
2
(p
0
).
Maximize over p
0
, get C = 1 q.
The (2, 1) code is 01, 10. With probability q, the 1 is lost, giving the out-
put 00, which is equivalent to the ? output of the Binary Erasure Channel.
With probability (1 q) there is no error; the two input words and the same
two output words are identied with the 0 and 1 of the BEC. The equivalent
BEC has erasure probability q. Now, this shows the capacity of the Z channel
is at least half that of the BEC.
This result is a bound, not an inequality, because our code constrains the
input distribution to be 50:50, which is not necessarily optimal, and because
weve introduced simple anticorrelations among successive bits, which optimal
codes for the channel would not do.
Solutions manual 719
Extra Solutions for Chapter 11
Solution to exercise 11.3 (p.184). In a nutshell, the encoding operations in-
volve additions and multiplies, and these operations are associative.
Let the source block be s
k2k1
and the transmitted block be t
n2n1
. Let
the two generator matrices be G
(1)
and G
(2)
. To conform to convention, these
matrices have to be transposed if they are to right-multiply.
If we encode horizontally rst, then the intermediate vector is
u
k2n1
=
k1
G
(1)T
n1k1
s
k2k1
(D.37)
and the transmission is
t
n2n1
=
k2
G
(2)T
n2k2
u
k2n1
(D.38)
=
k2
G
(2)T
n2k2
k1
G
(1)T
n1k1
s
k2k1
. (D.39)
Now, by the associative property of addition and multiplication, we can reorder
the summations and multiplications:
t
n2n1
=
k1
k2
G
(2)T
n2k2
G
(1)T
n1k1
s
k2k1
(D.40)
=
k1
G
(1)T
n1k1
k2
G
(2)T
n2k2
s
k2k1
. (D.41)
This is identical to what happens if we encode vertically rst, getting inter-
mediate vector
v
n2k1
=
k2
G
(2)T
n2k2
s
k2k1
(D.42)
then transmitting
t
n2n1
=
k1
G
(1)T
n1k1
v
n2k1
. (D.43)
Solution to exercise 11.6 (p.188). The fraction of all codes that are linear is
absolutely tiny. We can estimate the fraction by counting how many linear
codes there are and how many codes in total.
A linear (N, K) code can be dened by the M = N K constraints that
it satises. The constraints can be dened by a M N parity-check matrix.
Lets count how many parity-check matrices there are, then correct for over-
counting in a moment. There are 2
MN
distinct parity-check matrices. Most
of these have nearly full rank. If the rows of the matrix are rearranged, that
makes no dierence to the code. Indeed, you can multiply the matrix H
by any square invertible matrix, and there is no change to the code. Row-
permutation is a special case of multiplication by a square matrix. So the size
of the equivalence classes of parity-check matrix is 2
M
2
. (For every parity-
check matrix, there are 2
M
2
ways of expressing it.) So the number of dierent
linear codes is 2
MN
/2
M
2
= 2
MK
.
The total number of codes is the number of choices of 2
K
words from the
set of 2
N
possible words, which is
_
2
N
2
K
_
, which is approximately
(2
N
)
2
K
(2
K
)!
=
2
N2
K
(2
K
)!
. (D.44)
720 Solutions manual
The fraction required is thus
2
N
2
R(1R)
(2
K
)!
2
N2
K
. (D.45)
Solution to exercise 11.8 (p.188). A code over GF(8) We can denote the ele-
ments of GF(8) by 0, 1, A, B, C, D, E, F. Each element can be mapped onto
a polynomial over GF(2).
element polynomial binary representation
0 0 000
1 1 001
A x 010
B x + 1 011
C x
2
100
D x
2
+ 1 101
E x
2
+x 110
F x
2
+x + 1 111
(D.46)
The multiplication and addition operations are given by multiplication and
addition of the polynomials, modulo x
3
+x + 1.
Here is the multiplication table:
0 1 A B C D E F
0 0 0 0 0 0 0 0 0
1 0 1 A B C D E F
A 0 A C E B 1 F D
B 0 B E D F C 1 A
C 0 C B F E A D 1
D 0 D 1 C A F B E
E 0 E F 1 D B A C
F 0 F D A 1 E C B
(D.47)
Here is a (9,2) code over GF(8) generated by the generator matrix
G =
_
1 0 1 A B C D E F
0 1 1 1 1 1 1 1 1
_
(D.48)
000000000 011111111 0AAAAAAAA 0BBBBBBBB 0CCCCCCCC 0DDDDDDDD 0EEEEEEEE 0FFFFFFFF
101ABCDEF 110BADCFE 1AB01EFCD 1BA10FEDC 1CDEF01AB 1DCFE10BA 1EFCDAB01 1FEDCBA10
A0ACEB1FD A1BDFA0EC AA0EC1BDF AB1FD0ACE ACE0AFDB1 ADF1BECA0 AECA0DF1B AFDB1CE0A
B0BEDFC1A B1AFCED0B BA1CFDEB0 BB0DECFA1 BCFA1B0DE BDEB0A1CF BED0B1AFC BFC1A0BED
C0CBFEAD1 C1DAEFBC0 CAE1DC0FB CBF0CD1EA CC0FBAE1D CD1EABF0C CEAD10CBF CFBC01DAE
D0D1CAFBE D1C0DBEAF DAFBE0D1C DBEAF1C0D DC1D0EBFA DD0C1FAEB DEBFAC1D0 DFAEBD0C1
E0EF1DBAC E1FE0CABD EACDBF10E EBDCAE01F ECABD1FE0 EDBAC0EF1 EE01FBDCA EF10EACDB
F0FDA1ECB F1ECB0FDA FADF0BCE1 FBCE1ADF0 FCB1EDA0F FDA0FCB1E FE1BCF0AD FF0ADE1BC
Further exercises that can be based on this example:
Exercise D.1.
[2]
Is this code a perfect code?
Exercise D.2.
[2]
Is this code a maximum distance separable code?
Extra Solutions
Solution to exercise D.1 (p.720). The (9, 2) code has M = 7 parity checks,
and its distance is d = 8. If the code were perfect, then all points would be
at a distance of at most d/2 from the nearest codeword, and each point would
only have one nearest codeword.
Solutions manual 721
The (9, 2) code is not a perfect code. Any code with even distance cannot
be a perfect code because it must have vectors that are equidistant from the
two nearest codewords, for example, 000001111 is at Hamming distance 4
from both 000000000 and 011111111.
We can also nd words that are at a distance greater than d/2 from all
codewords, for example 111110000, which is at a distance of ve or more from
all codewords.
Solution to exercise D.2 (p.720). The (9, 2) code is maximum distance sepa-
rable. It has M = 7 parity checks, and when any 7 characters in a codeword
are erased we can restore the others. Proof: any two by two submatrix of G
is invertible.
Extra Solutions for Chapter 12
Solution to exercise 12.9 (p.201). log
36
6, 000, 000, 000 = 6.3, so a 7-character
address could suce, if we had no redundancy. One useful internet service
provided by shortURL.com is the service of turning huge long URLs into tiny
ones, using the above principle.
Email addresses can be as short as four characters (I know m@tc), but
roughly 15 is typical.
Extra Solutions for Chapter 13
Solution to exercise 13.6 (p.216). With (f) = 2f
1/2
(1 f)
1/2
, combining
(13.14) and (13.25), the average probability of error of all linear codes is
bounded by
P(block error))
w>0
A(w))[(f)]
w
w>0
2
N[H2(w/N)(1R)]
[(f)]
w
(D.49)
This is a sum of terms that either grow or shrink exponentially with N, depend-
ing whether the rst factor or the second dominates. We nd the dominant
term in the sum over w by dierentiating the exponent.
d
dw
N[H
2
(w/N) (1 R)] +wlog (f) = log
1 (w/N)
w/N
+ log (f) (D.50)
the maximum is at
w/N
1 (w/N)
= (f) (D.51)
i.e.,
w/N =
(f)
1 +(f)
=
1
1 + 1/(f)
. (D.52)
We require the exponent
N[H
2
(w/N) (1 R)] +wlog (f) (D.53)
to be negative at this point, then we can guarantee that the average error
probability vanishes as N increases. Plugging in the maximum-achieving
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5
Poor man
Shannon
Figure D.5. Poor mans capacity
(D.55) compared with Shannons.
w/N, we have shown that the average error probability vanishes if
H
2
_
1
1 + 1/(f)
_
+
1
1 + 1/(f)
log (f) < (1 R), (D.54)
722 Solutions manual
and we have thus proved a coding theorem, showing that reliable communi-
cation can be achieved over the binary symmetric channel at rates up to at
least
R
poor man
= 1
_
H
2
_
1
1 + 1/(f)
_
+
1
1 + 1/(f)
log (f)
_
. (D.55)
Extra Solutions for Chapter 13
Solution to exercise 13.15 (p.221). All the Hamming codes have distance d =
3.
Solution to exercise 13.16 (p.221). A code has a word of weight 1 if an entire
column of the parity-check matrix is zero. There is a chance of 2
M
= 2
360
that all entries in a given column are zero. There are M = 360 columns. So
the expected value at w = 1 is
A(1) = M2
M
= 360 2
360
10
111
. (D.56)
Solution to exercise 13.17 (p.221). This (15,5) code is unexpectedly good:
While the Gilbert distance for a (15,5) code is 2.6, the minimum distance of
the code is 7. The code can correct all errors of weight 1, 2, or 3. The weight
enumerator function is (1,0,0,0,0,0,0,15,15,0,0,0,0,0,0,1).
Solution to exercise 13.18 (p.221). See gure D.6.
w A(w)
0 1
5 12
6 10
8 15
9 20
10 6
Total 64
0
5
10
15
20
0 1 2 3 4 5 6 7 8 9 10 15
1
10
0 1 2 3 4 5 6 7 8 9 10 15
Figure D.6. The weight
enumerator function of the
pentagonful code (solid lines).
The dotted lines show the average
weight enumerator function of all
random linear codes with the
same size of generator matrix.
The lower gure shows the same
functions on a log scale. While
the Gilbert distance is 2.2, the
minimum distance of the code is 5.
Solution to exercise 13.25 (p.223). Heres a suggested attack on this still-
open problem. [I use dual-containing as an abbreviation for having a self-
orthogonal dual.] Pick an ensemble of low-density parity-check codes for
example, dened by making an M N matrix in which every column is a
random vector of weight j. Each column involves
_
j
2
_
pairs of rows. There are
a total of N
_
j
2
_
such pairs. If the code is dual-containing, every such pair must
occur an even number of times, most probably twice.
Estimate the probability of every pairs occuring twice. Multiply this prob-
ability by the total number of codes in the ensemble to estimate the number
that are dual-containing.
Solution to exercise 13.26 (p.223). The formula for the error probability pro-
duced by a single codeword of weight d is
(
=
1
1 + 2
H2(f)1
. (D.58)
In the case f = 1/3, p
(t)
1
2
(b + (a +b)(b +d))
P(01) = a
(t)
1
2
(a + (a +b)
2
)
P(10) = d
(t)
1
2
(d + (d +b)
2
) (D.61)
P(11) = b
(t)
where the rst terms (
1
2
b,
1
2
a, etc.) come from the event that both bits inher-
ited from a single parent.
The mean tness of one locus in an ospring is then
p (a
(t) +d
(t)), (D.62)
and the total tness, which is the sum of G/2 such terms, has a binomial
distribution with parameters (N, p) = (G/2, p), i.e., mean = Np and vari-
ance
2
= Np(1 p). Approximating this distribution by a Gaussian, and
assuming truncation selection keeps the top half of the distribution, the mean
tness after truncation will be +
_
2/, and the fractions at one locus are
adjusted, by this selection, to:
a
(t) a
(t)
p
p
, d
(t) d
(t)
p
p
(D.63)
where
p
= p +
_
2/
1
_
G/2
_
p(1 p). (D.64)
The parents of the next generation thus have fractions given by a(t+1) = a
(t)
and d(t + 1) = d
(t).
add graphs here from gene/xortheory
Extra Solutions for Chapter 22
Solution to exercise 22.15 (p.309). The likelihood has N maxima: it is in-
nitely large if is set equal to any datapoint x
n
and
n
is decreased to zero,
the other
n
being left at non-zero values. Notice also that the datas mean
and median both give lousy answers to the question what is ?
Well discuss the straightforward Bayesian solution to this problem later.
726 Solutions manual
Extra Solutions for Chapter 29
Solution to exercise 29.1 (p.362). The solution in the book is incomplete, as
the expression for the variance of
r
w
r
(x
(r)
)
r
w
r
, (D.65)
where
w
r
P
(x
(r)
)
Q
(x
(r)
)
, (D.66)
is not given. We focus on the variance of the numerator. (The variance of the
ratio is messier.)
But rst, lets note the key insight here: what is the optimal Q(x) going
to look like? If (x) is a positive function, then the magic choice
Q(x) =
1
Z
Q
P
(x)(x) (D.67)
(if we could make it) has the perfect property that every numerator term will
evaulate to the same constant,
P
(x
(r)
)
Q
(x
(r)
)
(x
(r)
) =
P
(x
(r)
)Z
Q
P
(x
(r)
)(x
(r)
)
(x
(r)
) = Z
Q
, (D.68)
which is the required answer Z
Q
=
_
dxP
(x
(r)
)
Q
(x
(r)
)
=
Z
Q
(x
(r)
)
. (D.69)
Its intriguing to note that for this special choice of Q, where the numerator,
even for just a single random point, is exactly the required answer, so that
the best choice of denominator would be unity, the denominator created by
the standard method is not unity (in general). This niggle exposes a general
problem with importance sampling, which is that there are multiple possi-
ble expressions for the estimator, all of which are consistent asymptotically.
Annoying, hey? The main motivation for estimators that include the denom-
inator is so that the normalizing constants of the distributions P and Q do
not need to be known.
So, to the variance. The variance of a single term in the numerator is, for
normalized Q,
var
_
P
(x)
Q(x)
(x)
_
=
_
dx
_
P
(x)
Q(x)
(x)
_
2
Q(x)
2
=
_
dx
P
(x)
2
Q(x)
(x)
2
2
(D.70)
To minimize this variance with respect to Q, we can introduce a Lagrange
multiplier to enforce normalization. The functional derivative with respect
to Q(x) is then
(x)
2
Q(x)
2
(x)
2
, (D.71)
which is zero if
Q(x) P
(x)[(x)[. (D.72)
Solutions manual 727
Solution to exercise 29.14 (p.382). Freds proposals would be appropriate if
the target density P(x) were half as great on the two end states as on all other
states. If this were the target density, then the factor of two dierence in Q for
a transition in or out of an end state would be balanced by the factor of two
dierence in P, and the acceptance probability would be 1. Freds algorithm
therefore samples from the distribution
P
(x) =
_
_
_
1/
20 x 1, 2, . . . , 19
1/
40 x 0, 20
0 otherwise
. (D.73)
If Fred wished to retain the new proposal density, he would have to change
the acceptance rule such that transitions out of the end states would only be
accepted with probability 0.5.
Solution to exercise 29.19 (p.384). Typical samples dier in their value of
log P(x) by a standard deviation of order
N, lets say c
n
P(x
n
[)
P(x
n
)
(D.78)
The likelihood function contains a complete summary of what the experiment
tells us about . The log likelihood,
L() =
n
[x
n
[, (D.79)
is sketched in gure D.8.
The most probable values of are 0.92, and the posterior probability falls
by a factor of e
2
once we reach 0.1 and 3, so a range of plausible 0 values for
is (0.1, 3).
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
Figure D.8. Sketches of likelihood
function. Top: likelihood function
on a log scale. The gradient
changes by 2 as we pass each data
point. Gradients are 4, 2, 0, 2,
4. Bottom: likelihood function
on a linear scale. The exponential
functions have lengthscales
1/
4,
1/
2,
1/
2,
1/
4.
Solutions manual 729
Extra Solutions for Chapter 36
Solution to exercise 36.5 (p.454).
Preference of A to B means
u(1) > .89u(1) +.10u(2.5) +.01u(0) (D.80)
Whereas preference of D to C means
.89u(0) +.11u(1) < .90u(0) +.10u(2.5) (D.81)
.11u(1) < .01u(0) +.10u(2.5) (D.82)
u(1) < .89u(1) +.10u(2.5) +.01u(0) (D.83)
which contradicts (D.80).
Solution to exercise 36.9 (p.456). The probability of winning either of the
rst two bets is 6/11 = 0.54545. The probability that you win the third bet
is 0.4544. Joe simply needs to make the third bet with a stake that is bigger
than the sum of the rst two stakes to have a positive expectation on the
sequence of three bets.
The Las Vegas trickster
Solution to exercise 36.9 (p.456). On a single throw of the two dice, let the
outcomes 6 and 7 have probabilities P(6) = p
6
and P(7) = p
7
. Note P(8) = p
6
.
The values are p
6
= 5/36 and p
7
= 6/36 = 1/6. For the rst bet, we can
ignore other outcomes apart from the winning and losing outcomes 7 and 6
and compute the probability that the outcome is a 7, given that the game has
terminated,
p
7
p
6
+p
7
= 6/11 = 0.54545. (D.84)
The second bet is identical. Both are favourable bets.
The third bet is the interesting one, because it is not a favourable bet for
you, even though it sounds similar to the two bets that have gone before. The
essential intuition for why two sevens are less probable than an 8 and a 6 is
that the 8 and the 6 can come in either of two orders, so a rough factor of two
appears in the probability for 8 and 6.
Computing the probability of winning is quite tricky if a neat route is not
found. The probability is most easily computed if, as above, we discard all the
irrelevant events and just compute the conditional probability of the dierent
ways in which the state of the game can advance by one step. The possible
paths taken by this pruned game with their probabilities are shown in the
gure as a Markov process. (The original unpruned game is a similar Markov
process in which an extra path emerges from each node, giving a transition
back to the same node.) The node labelled A denotes the initial state in
which no 6s, 7s or 8s have been thrown. From here transitions are possible
to state 7 in which exactly one 7 has been thrown, and no 6s or 8s; and to
state E, in which either [one or more 8s have occurred and no 6s or 7s] or
[one or more 6s have occurred and no 6s or 7s]. The probabilities of these
transitions are shown. We can progress from state E only if Joes winning 6
or 8 (whichever it is) is thrown, or if a 7 occurs. These events take us to the
states labelled 68 and E7 respectively. From state 7 the game advances
when a 6 or 8 is thrown, or when a 7 is thrown, taking us to states E7 and
77 respectively. Finally, from state E7, if a 7 is thrown we transfer to state
730 Solutions manual
A
`
_
7
`
_
E
`
_
68
`
_
E7
`
_
77
`
_
678
`
_
E77
`
_ p7
p6+p7+p8
p7
p6+p7+p8
p6+p8
p6+p7+p8
p6
p6+p7
p6+p8
p6+p7+p8
p7
p6+p7
p6
p6+p7
p7
p6+p7
3
Q
Q
Q
Qs
3
Q
Q
Q
Qs
3
Q
Q
Q
Qs
3
Q
Q
Q
Qs
A
`
_
7
`
_
E
`
_
68
`
_
E7
`
_
77
`
_
678
`
_
E77
`
_
6
16
6
16
10
16
5
11
10
16
6
11
5
11
6
11
3
Q
Q
Q
Qs
3
Q
Q
Q
Qs
3
Q
Q
Q
Qs
3
Q
Q
Q
Qs
Figure D.9. Markov process
describing the Las Vegas dice
game, pruned of all irrelevant
outcomes. The end states 68 and
678 are wins for Joe. States E77
and 77 are wins for you. Please do
not confuse this state diagram, in
which arrows indicate which
states can follow from each other,
with a graphical model, in which
each node represents a dierent
variables and arrows indicate
causal relationships between
them.
E77, and if Joes required 6 or 8 is thrown, we move to state 678. States 68
and 678 are wins for Joe; states 77 and E77 are wins for you.
We rst need the probability of state E7,
(10/16)(6/11) + (6/16)(10/16) = 405/704 = 0.5753 (D.85)
The probability that you win is
P(77) +P(E77) = (6/16)
2
+P(E7)(6/11) = 3519/7744 = 0.4544 (D.86)
The bet is not favourable. Notice that Joe simply needs to make the third
bet with a stake that is bigger than the sum of the rst two stakes to have a
positive expectation on the sequence of three bets.
Extra Solutions for Chapter 39
Solution to exercise 39.1 (p.472). One answer, given in the text on page 472,
is that the single neuron function was encountered under the best detection
of pulses. The same function has also appeared in the chapter on variational
methods when we derived mean eld theory for a spin system. Several of the
solutions to the inference problems in chapter 1 were also written in terms of
this function.
Solution to exercise 39.5 (p.480). If we let x and s be binary 1
7
, the
likelihood is (1 f)
N
f
M
, where N = (s
T
x +7)/2 and M = (7 s
T
x)/2. From
here, it is straightforward to obtain the log posterior probability ratio, which
is the activation.
The LED displays a binary code of length 7 with 10 codewords. Some
codewords are very confusable 8 and 9 dier by just one bit, for example.
A superior binary code of length 7 is the (7, 4) Hamming code. This code has
15 non-zero codewords, all separated by a distance of at least 3 bits.
Here are those 15 codewords, along with a suggested mapping to the inte-
gers 014.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Solution to exercise 39.6 (p.480).
log
P(s = 1[r)
P(s = 2[r)
= log
P(r[s = 1)P(s = 1)
P(r[s = 2)P(s = 2)
= log
_
1 f
f
_
2r11
+ log
_
1 f
f
_
(2r31)
+ log
P(s = 1)
P(s = 2)
= w
1
r
1
+w
3
r
3
+w
0
,
Solutions manual 731
where
w
1
= 2 log
_
1 f
f
_
, w
3
= 2 log
_
1 f
f
_
, w
0
= log
P(s = 1)
P(s = 2)
, (D.87)
and w
2
= 0, which we can rearrange to give
P(s = 1 [ r) =
1
1 + exp
_
w
0
3
n=1
w
n
r
n
_.
This can be viewed as a neuron with two or three inputs, one from r
1
with a
A
A
A
A
A
, ,
,
w
0 `
_
6
P(s = 1 | r)
r
1
r
3
w
1
w
3
positive weight, and one from r
3
with a negative weight, and a bias.
Extra Solutions for Chapter 40
Solution to exercise 40.6 (p.490).
(a) w = (1, 1, 1).
(b) w = (1/4, 1/4, 1).
The two unrealizable labellings are 0, 0, 0, 1 and 1, 1, 1, 0.
Solution to exercise 40.8 (p.490). With just a little compression of the raw
data, its possible your brain could memorize everything.
Extra Solutions for Chapter 41
Solution to exercise 41.2 (p.502). When w Normal(w
MP
, A
1
), the scalar
a = a(x; w
MP
) + (w w
MP
) x is Gaussian distributed with mean a(x; w
MP
)
and variance s
2
=x
T
A
1
x.
This is easily shown by simply computing the mean and variance of a,
then arguing that as distribution must be Gaussian, because the marginals of
a multivariate Gaussian are Gaussian. (See page 176 for a recap of multivariate
Gaussians.) The mean is
a) = a(x; w
MP
) + (ww
MP
) x) = a(x; w
MP
)+(ww
MP
))x = a(x; w
MP
).
The variance is
(a a(x; w
MP
))
2
_
= x (ww
MP
)(ww
MP
) x) (D.88)
= x
T
(ww
MP
)(ww
MP
)
T
) x = x
T
A
1
x.
Solution to exercise 41.3 (p.503). In the case of a single data point, the like-
lihood function, as a function of one parameter w
i
, is a sigmoid function; an
example of a sigmoid function is shown on a logarithmic scale in gure D.11a.
The same gure shows a Gaussian distribution on a log scale. The prior dis-
tribution in this problem is assumed to be Gaussian; and the approximation
Q is also a Gaussian, tted at the maximum of the sum of the log likelihood
and the log prior.
The log likelihood and log prior are both concave functions, so the curva-
ture of log Q must necessarily be greater than the curvature of the log prior.
But asymptotically the log likelihood function is linear, so the curvature of the
log posterior for large [a[ decreases to the curvature of the log prior. Thus for
suciently large values of w
i
, the approximating distribution is lighter-tailed
than the true posterior.
732 Solutions manual
(a)
0.001
0.01
0.1
1
-6 -4 -2 0 2 4 6
f(a)
g(a)
(b)
0.0001
0.001
0.01
0.1
1
-6 -4 -2 0 2 4 6
f(a)g(a)
Q
Figure D.11. (a) The log of the
sigmoid function
f(a) = 1/(1 + e
a
) and the log of
a Gaussian g(a) Normal(0, 4
2
).
(b) The product P = f(a)g(a)
and a Gaussian approximation to
it, tted at its mode. Notice that
for a range of negative values of a,
the Gaussian approximation Q is
bigger than P, while for values of
a to the right of the mode, Q is
smaller than P.
This conclusion may be a little misleading however. If we multiply the
likelihood and the prior and nd the maximum and t a Gaussian there, we
might obtain a picture like gure D.11b. Here issues of normalization have
been ignored. The important point to note is that since the Gaussian is
tted at a point where the log likelihoods curvature is not very great, the
approximating Gaussians curvature is too small for a between a
MP
and a
MP
,
with the consequence that the approximation Q is substantially larger than
P for a wide range of negative values of a. On the other hand, for values of a
greater than a
MP
, the approximation Q is smaller in value than P.
Thus whether Q is for practical purposes a heavy-tailed or light-tailed
approximation to P depends which direction one looks in, and how far one
looks.
The Gaussian approximation becomes most accurate when the amount of
data increases, because the log of the posterior is a sum of more and more bent
functions all of which contribute curvature to the log posterior, making it more
and more Gaussian (c.f. gure 41.1). The greatest curvature is contributed
by data points that are close (in terms of a) to the decision boundary, so the
Gaussian approximation becomes good fastest if the optimized parameters are
such that all the points are close to the decision boundary, that is, if the data
are noisy.