Probabilityand Coding

Probability and Information Theory
J. Andres Montoya
May 21, 2020
1
1 Probabilistic Inference: Bayes Theorem
Let X; Y1 ; :::; Yn be random variables that are jointly distributed according to a
probability mass function P (x; y1 ; :::; yn ) : Suppose one knows that the equalities
Y1 = a1 ; :::; Yn = an hold, and suppose that he wants to say something about
the possible value of X: The most likely value of X is the element a for which
the equality
P (a; a1 ; :::; an ) = max fP (z; a1 ; :::; an ) : z 2 dom (X)g
holds. The latter is a typical task in probabilistic inference: to determine the

most likely value of a random variable, given some contextual information.
1.1 Coding and Decoding

Coding theory is the mathematical basis of the error correction techniques that
are used in all the modern technologies related to data transmission and data
storage. The basic model of communication is constituted by an imperfect
channel and two parties, Alice and Bob, who want to use this channel to ex-
change some information. We suppose that the messages to be communicated
are strings over the alphabet f0; 1g. Given a message w of length n; we can
think of w as if it were a n-dimensional vector w 2 Fn2 . The imperfection of
the channel means that any message (vector) transmitted through the channel
can su¤er some modi…cations. The latter means that the receiver will get only
a corrupted version of the transmitted message.
Suppose that m = (a1 ; :::; an ) ; and let m(2) = (a1 ; a1 ; :::; an ; an ) : Notice that
(2)
m is more resilient than m :
If character a2i 1 is lost, its information content can be recovered from

the character a2i :
If character a2i is lost, its information content can be recovered from the
character a2i 1 :
We have gained some resilience after inserting redundant information. Notice

that we have paid a price for this: we have to transmit longer messages.
De…nition 1 A q-ary linear error correcting code of length N is a vector space

dim(c)
c FN q . We use the symbol r (T ) to denote the ratio N : We say that r (T )
is the information ratio of the code T:
Example 2 r (T2 ) = 2:
1.1.1 Hamming Codes

The Hamming code H4 is a binary code de…ned by the linear map
(x1 ; :::; x4 ) 7! (x1 ; :::; x4 ; x1 + x2 + x3 ; x1 + x2 + x4 ; x1 + x3 + x4 ) :
2
If one wants to use this code for transmission, then, given a message (x1 ; :::; x4 ) ;
he has to compute three parity check bits, the bits x1 +x2 +x3 ; x1 +x2 +x4 ; x1 +
x3 + x4 ; and construct the codeword
(x1 ; :::; x4 ; x1 + x2 + x3 ; x1 + x2 + x4 ; x1 + x3 + x4 ) :
This is the codeword that is transmitted using the insecure channel at hand.
Suppose that this channel is a symmetric binary channel with noise f 2 0; 21 ;
that is: f is the probability of a bit-‡ip. Suppose also that Alice wants to
4
transmit a message w 2 f0; 1g , and suppose that Bob is aware of Alice’s desire.
Notice that, from the point of view of Bob, there are two random variables that
matter: variable X; whose value is Alice’s choice of a message, and variable Y;
whose value is the string of length 7 that will be seen by Bob after transmission.
Thus, transmission will completely reveal the value of Y; but it wont completely
reveal the value of X because of the noise. The best thing that Bob can do is
to determine the most likely choice of Alice, given the known value of Y: Can
it be done? Notice that the decoding question is a typical inference task that
must be solved probabilistically.
1.2 Decoding, Probabilistic Inference and Bayes theorem

Suppose we are given (x1 ; :::; x4 ) ; and suppose that we are asked to compute
the probability that, after transmitting the codeword H4 (x1 ; :::; x4 ), the i-th bit
received is equal to yi .
It is quite easy to compute the above probability. Let hi be equal to i-th bit
of H4 (x1 ; :::; x4 ) ; the requested probability is equal to
f , if yi 6= hi
Pr (Yi = yi j Hi = hi ) = :
1 f; otherwise
We have that this forward probability can be straightforwardly computed

from the de…nition of the channel. We would like to stress that the computation
of those forward probabilities are completely transparent: if we think for a
moment that we are Bob, then we are being asked to compute the probability
of occurrence of an event that we could see and which is related to the random
variable that we can control.
Now suppose that we are said that the i-th bit received was equal to yi :
Suppose that we are asked to decide which could have been the transmitted
bit. Thus, we are asked to compute a backward probability, the probability
Pr (Hi = xi j Y = yi ) : The computation of this probability seems to be less
transparent than the computation of the forward probability, this time we have
to take into account the probability distribution of the variable Hi : We feel that
this computation (which is also straightforward) is obscured by the fact that
we, Bob, are being asked to compute the probability of an event that we cannot
see, and which can only be guessed (inferred in a probabilistic sense). We can
use Bayes Theorem to deal with those obscure backward probabilities
3
Theorem 3 Bayes Theorem
Let X; Y be two random variables that are jointly distributed according to a
probability distribution P (x; y) ; we have
Pr (Y = y j X = x) Pr (X = x)
Pr (X = x j Y = y) = :
Pr (Y = y)
It is very easy to prove the above theorem (in despite of its strength). We
leave the proof as an exercise for the rider.
Exercise 4 Prove Bayes theorem, and use it to compute the above backward
probabilities.
Bayes Theorem is named after Thomas Bayes (1701–1761), who studied how
to compute a distribution for the probability parameter of a binomial distribu-
tion. Bayes’unpublished manuscript was signi…cantly edited by Richard Price
before it was posthumously read at the Royal Society [1]. Price edited Bayes’
major work “An Essay towards solving a Problem in the Doctrine of Chances”,
which appeared in Philosophical Transactions, and contains Bayes Theorem.
Price wrote an introduction to the paper which provides some of the philosoph-
ical basis of Bayesian statistics. In 1765, he was elected a Fellow of the Royal
Society in recognition of his work on the legacy of Bayes. By modern standards,
we should refer to the Bayes–Price rule. Price discovered Bayes’ work, recog-
nized its importance, corrected it, contributed to the article, and found a use
for it.
1.2.1 Bayesian Inference: Hypothesis Testing

Bayes Theorem is the basis of Bayesian inference, a method used in proba-
bilistic inference to update the probability for a hypothesis as more evidence or
information becomes available. Bayes Update Rule works as follows:
Let H be a hypothesis whose probability may be a¤ected by data (evi-

dence).
Let P (H) be the prior probability, this is an estimation of the probability
of hypothesis H before the data E; the current evidence, is observed.
E is the evidence, it corresponds to new data that were not used in com-
puting the prior probability.
P (H j E) is the posterior (backward) probability, and it is the probability
of H after the data E are observed. This is what we want to know: the
probability of a hypothesis given the observed evidence.
P (E j H) is the forward probability of observing E given the hypothesis
H, and is called the likelihood.
P (E) is sometimes termed the marginal likelihood or model evidence.
4
Suppose one proposes a model H (hypothesis). If H is a predictive model it
yields a method for the computation of the forward probabilities P (E j H) : The
existence of a method for the computation of the backward probabilities is less
clear. Bayes Theorem ensures that if the former probability can be computed,
then the latter probability can be also computed. Bayes Update Rule (Bayes
Theorem) estates that
P (E j H) P (H)
P (H j E) = ;
P (E)
and it is widely used in probabilistic inference to update the probability of
a hypothesis as new information is collected. Let us consider an elementary
example of this application.
Suppose there are two full bowls of cookies. Bowl 1 has 10 chocolate chip
and 30 plain cookies, while bowl 2 has 20 of each. Our friend Fred picks a bowl
at random, and then picks a cookie at random. We may assume there is no
reason to believe Fred treats one bowl di¤erently from another, likewise for the
cookies. The cookie turns out to be a plain one. How probable is it that Fred
picked it out of bowl 1? Here we have an hypothesis: Bob chose bowl 1: And
we also have data: the chosen cookie is plain. Now, we want to evaluate the
probability of the hypothesis given the data. It seems clear that the answer
should be more than a half, since there are more plain cookies in bowl 1. The
precise answer is given by Bayes’theorem:
Let Hi be the hypothesis: the chosen bowl is bowl number i: We assumed
that the bowls are identical from Fred’s point of view, and it implies that
1
P (H1 ) = P (H2 ) = :
2
Both hypothesis are predictive, they allow us to compute the forward prob-
abilities (probability of picking one of the two possible types of cookies). If E
is equal to - the chosen cookie is plain - we get that
3 1
Pr (E j H1 ) = and Pr (E j H2 ) = :
4 2
Notice that P (E) is a marginal probability: the probability of event E given
null information about the additional variables. This marginal probability can
be hard to compute. However, this is not the case with our elementary example.
We have that
3 2 5
P (H1 ) Pr (E j H1 ) + P (H2 ) Pr (E j H2 ) = + = :
8 8 8
We get from Bayes rule that
Pr (E j H1 ) P (H1 ) 3
Pr (H1 j E) = = :
P (E) 5
We can conclude that H1 is more likely than H2 (given the data) since:
2 3
Pr (H2 j E) = < < Pr (H1 j E) :
5 5
5
1.3 Decoding
Let v; w 2 FN
q : The Hamming distance between those two strings is equal to
jfi N : v [i] 6= w [i]gj :
From now on we study the following problem
Problem 5 Decoding of Linear Codes
Input: (c; w) ; where c FN N

q is a linear code and w 2 Fq :
Problem: compute the vector v 2 c that minimizes the Hamming distance

to vector w:
Suppose that w is the vector received by Bob after transmission, and let v be
the codeword that minimizes the Hamming distance to w. The codeword v is the
most likely message. Thus, the above problem, or some relevant subproblem, is
the problem that we have to solve if we want to decode the messages transmitted
by Alice. We have to remark that this problem is NP-hard [3]. The latter means
that we cannot e¢ ciently process all the instances of this problem. We notice
that this is not the end of the story, given that we do not want to decode all
the possible codes, we want to decode the codes that matter. We have to ask:
what are the codes that matter? The latter question is a hard question which
lacks a de…nitive answer, this question is one of the foundational question of
Information Theory.
References
[1] T. Bayes. An Essay towards solving a Problem in the Doctrine of Chance.
By the late Rev. Mr. Bayes, communicated by Mr. Price, in a letter to John
Canton, A. M. F. R. S. Philosophical Transactions of the Royal Society of
London 53: 370–418, 1763.
[2] W. Kurt. Bayesian Statistics the Fun Way: Understanding Statistics and
Probability with Star Wars, LEGO, and Rubber Ducks. No starch Press,
San Francisco, 2019.
[3] E. Berlekamp, R. McEliece, and H Van Tilborg. On the inherent intractabil-
ity of certain coding problems. IEEE Trans. Inform. Theory, 24: 384–386,
1978.
6
2 Uncertantity and Entropy
In the latter lecture we entered the world of information theory. We will stay
in this world for a while. We want to know which are the linear codes that is
worth to decode, and we want to know if we can decode those codes e¢ ciently.
To answer the former question we will need the fundamental notion of entropy.
2.1 Shannon’s entropy

Let us suppose that Alice is at the hippodrome looking at the races. There are
N horses taking part in the …nal race. Suppose that they are called 1; :::; N .
Let i 1; horse i has assigned a probability pi of becoming the winner. This
probability depends on its palmares and its recent achievements. Let us suppose
that Alice and Bob, who love those races, know those probabilities. And let us
also suppose that Bob is not at the hippodrome, and that Alice wants to tell
Bob the name of the winner. Thus, after the end of the race, she will send a
message to Bob with this information. The message can be either 1 or 2 or...
or N: Now suppose that the winner is i: What is the information content of
message i?
Let us suppose that there exists a horse j n which is the only possible
winner of the above race. In this special case the transmission of message j
will solve a null amount of uncertainty for Bob. We claim that, in this case,
the transmitted string has null information. Let us now suppose that there are
two possible winners, j and k; and let us also suppose that j is a very much
more likely winner than k: If Bob receives massage j; he wont be surprised, and
he could even consider that he was given no information at all: he could argue
that he could predict this outcome. Observe that it is just the opposite if Bob
receives k; he will be strongly surprised because he could not anticipate this
result.
De…nition 6 Let X be a random variable distributed over the set f1; :::; ng with
probabilities fp1 ; :::; pn g : Shannon’s Information Content of event i is equal to
log p1i :
We observe that the above de…nition makes sense: the information content
of event i is directly proportional to the surprise that it could produce. Let X
be the random variable de…ned by
the value of X is the possible outcome of the above race.
If we agree with Shannon’s de…nition of information content, then we have to

agree that the expected information content of Alice’s message is equal to
X 1 X
H (X) = pi log = pi log pi
pi
i N i N
7
De…nition 7 Let X1 ; :::; Xn be a tuple of random variables, which are jointly
distributed over a …nite set A1 An . The (join) entropy of the tuple
(X1 ; :::; Xn ) is equal to
X
H (X1 ; :::; Xn ) = Pr [X1 = a1 ; :::; Xn = an ] log (Pr [X1 = a1 ; :::; Xn = an ]) :
(a1 ;:::;an )
By convention we assume that 0 log (0) is equal to 0:

Exercise 8 Let n be the set of all the probability distributions that can be
de…ned over the set f1; :::; ng : Prove that the function H n has a maximum
and a minimum. Compute the extreme values of the latter function.
De…nition 9 Let X; Y be random variables, de…ne the conditional entropy of
X given Y as
H (X j Y ) = H (X; Y ) H (Y ) :
Exercise 10 When does the equality H (X j Y ) = 0 holds?
De…nition 11 The mutual information of two random variables X and Y is
equal to
I (X; Y ) = H (X) + H (Y ) H (X; Y ) :
Exercise 12 What does it mean that I (X; Y ) = 0? And, what does it mean
that I (X; Y ) = H (X)?
Exercise 13 Prove that H (X) ; H (X j Y ) and I (X; Y ) are always positive.
De…ne I (X; Y j Z) as
H (X j Z) + H (Y j Z) H (X; Y j Z) ;
and prove that this latter information measure is also positive.
n
Exercise 14 Let X; Y; Z be three random variables over the set f0; 1g . Sup-
pose that X and Y are uniformly and independently distributed over this set,
and suppose that Z = X + Y , (where the sum is computed componentwise us-
ing the sum modulo 2). Compute I (X; Y ) ; I (X; Z) ; I (Y ; Z) ; also compute
I (X; Y; Z) ; I (Y ; X; Z) and I (Z; X; Y ) :
2.2 Symbol Codes: the Meaning of Entropy

The telegraph used an alphabet of size 2 for transmission: short pulse, long
pulse. Let us use the symbols 0 and 1 to denote the characters of this alphabet,
and let us suppose that Alice’s Gran wants to go to the post o¢ ce in order to
telegraphically transmit a string w 2 f1; :::; ng : String w must be coded before
transmission, it must be encoded as a string cw 2 f0; 1g : The easiest way
of encoding Gran’s messages is by …xing an injective function c : f1; :::; ng !
f0; 1g and applying it to string w1 wn as follows:
cw = b
c (w) = c (w1 ) c (wn ) :
However, we have to observe that we cannot use any function c to do the latter.
8
De…nition 15 An injective function c : f1; :::; ng ! f0; 1g is a pre…x-free
code, if and only if, there does not exist i; j 2 f1; :::; ng such that c (i) is a pre…x
of c (j) :
Suppose that c is a pre…x-free code, and let cw be the codeword assigned to

w by code c: The codeword cw can be transmitted bit by bit using the telegraph.
We get that: if we …x an arbitrary pre…x-free code of the alphabet f1; :::; ng ;
then transmission becomes possible.
We have to take into account that the transmission of each bit has a cost, and
the total cost of the transmission is proportional to the length of the codewords.
Then, it would be a good idea to minimize the average length of those codewords.
Let L be a pre…x-free code for alphabet , and let fp1 ; :::; pn g be the probability
distribution of the characters of . The average length of L is equal to
X
E (L) = pi jc (i)j :
i n
An optimal code for the distribution fp1 ; :::; pn g is a code that minimizes
this length.
Theorem 16 Let L be a pre…x-free code that minimizes the average length of

the codewords given the distribution fp1 ; :::; pn g. We have that
H (p1 ; :::; pn ) E [L] H (p1 ; :::; pn ) + 1;
where H (p1 ; :::; pn ) is the Shannon entropy of this distribution (random vari-
able).
This theorem tells us that the entropy of an information source determines

the compression rates that can be achieved for this source. Let us check some
key facts, related to the proof of the above theorem, and which can shed some
light on the meaning of Shannon’s Entropy. From now on we use the symbol n
to denote the alphabet f1; :::; ng : We suppose that fp1 ; :::; pn g is the probability
distribution of the letters in n :
De…nition 17 A binary pre…x code for n is a pre…x-free set of strings
fw1 ; :::; wn g f0; 1g :
Theorem 18 Kraft’s Inequality
1. Let fw1 ; :::; wn g be a binary pre…x code of n , and

X let l1 ; :::; ln be the lengths
of the code-words. We have that the inequality 2 li n holds.
i n
2. Let l1 ; :::; ln beX

n positive integers and suppose that those integers satisfy
the inequality 2 li 1; then there exists a binary pre…x code of n
i n
such that the lengths of the codewords are the positive integers l1 ; :::; ln :
9
Proof. Let L l1 ; :::; ln . There are exactly 2L wordsX
of length L: Let us say
L
that all those words have weight 2 . We have that weigth (w) = 1: Let
jwj=L
wi ; wj be two code-words of lengths li and lj . We have that those two words do
not have common heirs, and we have that wi has 2L li heirs of length L: Then,
we have
X
1 weigth (w)
w is heir of a co de-word
X X
= 2L li
2 L
= 2 li
:
i n i n
The proof of the second item is analogous and we omit it.
Exercise 19 Let fw1 ; :::; wn g be a binary pre…x code, prove that the average
length of this code is bounded below by H (p1 ; :::; pn ) :
Theorem 20 There exists a pre…x code for n whose average length is bounded
above by H (p1 ; :::; pn ) + 1:
Proof. Set li = d log pi e. It is easy to check that l1 ; :::; ln satisfy Kraft’s

inequality. Then, there exists a pre…x code such that the lengths of the code
words are the integers l1 ; :::; ln : It is also easy to check that
0 1
X X
pi d log (pi )e @ pi log (pi )A + 1:
i n i n
The proof of the above theorem allows us to give the term log (pi ) an impor-
tant and useful interpretation: log (pi ) is equal to the number of bits that are
used to encode i in an optimal encoding of n : The entropy of this probability
distribution is the average length of this optimal encoding.
References
[1] C Shannon. A Mathematical Theory of Communication. Bell System Tech-
nical Journal 27: 379–423, 623–656, 1948.
10
3 Data Compression and Shannon’s Source Cod-
ing Theorem
Let us go back with Alice’s Gran. Recall that she went to the post o¢ ce in
order to send a telegram. There, in the post o¢ ce, she met the operator of the
telegraph, a lovely young guy who wants to be as helpful as possible.
The operator observed that each character has a characteristic frequency
within the text. Let us use the symbol pi to denote the frequency of charac-
ter i: The set fp1 ; :::; pn g is a probability distribution over n ; and it is the
probability distribution that matters for this message. The operator decides to
try an e¢ cient encoding of Gran’s message. Given the empirical distribution
fp1 ; :::; pn g ; the operator can look for a pre…x-free code of the alphabet n : It
is for sure that he can …nd a pre…x-free code c such that the average length of c
is bounded above by H (p1 ; :::; pn ) + 1: Let w be the message written by Alice’s
Gran, if the operator uses b c he gets a codeword cw for which the inequality
jcw j jwj H (p1 ; :::; pn ) + jwj :
holds on average. We ask: is it possible to do better?
1. Is it possible to compress the message w beyond the above bound?
2. The above coding scheme does not work in real-time. Coding with this
scheme requires two passes over the string w: In the …rst pass one com-
putes the distribution fp1 ; :::; pn g ; and in the second pass he computes the
codeword cw : We ask: does there exist a real-time coding scheme achieving
a similar upper bound?
3.1 A Theoretical Coding Scheme: Shannon’s Source Cod-

ing Theorem
Shannon’s Source Coding Theorem determines the compression ratios that can
be achieved for Gran’s message.
De…nition 21 Let c be a coding scheme for message w, the compression ratio

achieved by this code is equal jcjwj
wj
:
Exercise 22 Prove that the compression ratio of the coding c; discussed in pre-
vious section, is bounded above by
H (p1 ; :::; pn ) + 1:
Shannon’s Theorem states that:
1. H (p1 ; :::; pn ) is a lowerbound for the compression ratio.

2. This bound can be asymptotically achieved.
11
Exercise 23 Prove the …rst item.
We focus on the second item, which has an existential (constructive) char-
acter. We have to ask: what does it mean that ratio H (p1 ; :::; pn ) can be
asymptotically achieved?
We can imagine that Alice’s Gran is not the author of a single message, but
an information source with distribution fp1 ; :::; pn g : The latter means that:
If Alice’s Gran is in mode message, she chooses at random a length m and a
message u 2 m n : String u is chosen character by character. The i-th character
of u is chosen from n according to distribution fp1 ; :::; pn g :
Let w be the original message written by Alice’s Gran, and which motivated
all this discussion. We have that w is just one of the in…nitely many messages
that can be written by this lovely old woman. Thus, we can try to construct a
suitably code for this information source and not only for w: If we do the latter
we will have to code arbitrarily long strings, and hence it will become possible
to asymptotically achieve Shannon’s Limit: the information ratio achieved for
long strings approaches Shannon’s Limit.
Remark 24 We would like to remark that w is a typical string for the above dis-
tribution: the empirical frequency of character i; which is de…ned as jfk jwj:w[i]=kgj
jwj ;
is exactly equal to pi :
We prove, in next section, that it is possible to asymptotically achieve the
ratio H (p1 ; :::; pn ) using block codes of increasing length.
N
De…nition 25 A block code of length N is a pre…x-free embedding of n into
f0; 1g :
Remark 26 Suppose M > N: One can use a block code of length N to code
strings of length M: Suppose w 2 M n ; one can factor this string as a sequence
of blocks of length N plus a single factor (the rightmost factor) whose length is
upperbounded by N: Suppose that w is factored as u1 uk uk+1 , and suppose
that juk+1 j < N: The codeword assigned to w is the string
b
c (w) = c (u1 ) c (uk ) e
c (uk+1 ) ;
where ec is a second code that is applied to the strings whose length is strictly
lesser than N: We are not interested in the code e c; and we will focus on the
behavior of c:
3.1.1 Shannon’s Probabilistic Construction

In this section we study the proof of Shannon’s Source Coding Theorem. We
would like to remark that this proof is one of the …rst probabilistic proofs in the
history of mathematics.
Suppose we have an information source with distribution fp1 ; :::; pn g : We
show that given " > 0; it is possible to construct a block code whose information
ratio is bounded above by H (p1 ; :::; pn )+": We have to remark that those ratios
are not achieved by lossless codes, but for lossy ones.
12
De…nition 27 Suppose we have an information source with distribution fp1 ; :::; pn g ;
and let c : N
n ! f0; 1g be a binary block code for this source.
N
1. We say that c is a lossless code, if and only if, given w 2 n one can
deterministically recover w from its code c (w) :
2. Let " > 0: We say that c is "-lossy, if and only if, the inequality
Pr [One can correctly compute w from c (w)] 1 "
w2DN N
n
N
holds, where DN is the probability distribution over n that is determined
by the information source under consideration.
Let S be an information source with probability distribution fp1 ; :::; pn g :
Shannon’s probabilistic argument allows us to show that for all " > 0 there ex-
ists a "-lossy code whose information ratio is bounded above by H (p1 ; :::; pn )+":
How can we prove such a thing? A general strategy in problem solving con-
sists in reducing a complex problem to a subproblem that is completely consti-
tuted by easy instances. Think in the solution of linear systems, we reduce the
non-triangular instances to the triangular instances using Gaussian elimination.
Thus, let us ask: for which information sources is it easy to construct a code
achieving Shannon’s Limit?
Exercise 28 Suppose that n is a power of two and suppose that fp1 ; :::; pn g is
log(n)
the uniform distribution. Fix an injection c : n ! f0; 1g : Prove that the
information ratio of the pre…x-free code determined by c is equal to H (p1 ; :::; pn ) :
Exercise 29 Suppose that n is not a power of two, but suppose that the proba-
bility distribution is still uniform. Show that Shannon’s Limit is asymptotically
achieved using lossless codes.
We have that Shannon’s Limit can be easily achieved in the uniform case.
Can we reduce the general case to the uniform one?
Let D = fp1 ; :::; pn g be the probability distribution of our information source
and let H (D) be its Shannon entropy. Given N; we use the symbol DN to denote
the probability distribution that is induced by D over the set N n : This latter
probability distribution is de…ned by the equality
Pr [w] = pN
1
1
pN
n ;
n
where given i n the symbol Ni denotes the number of occurrences of letter i

in w: It can be observed that DN becomes more concentrated as N grows: DN
is concentrated around the typical strings of length N:
Suppose that w is a string such that for all i n the relative frequency of
character i is close to pi : We would say, without hesitation, that w is a typical
string. Let us pick one of those typical strings, say u: We have
Pr [u] = pN
1
1
pN
n
n
= 2N1 log(p1 ) 2Nn log(pn )

P Ni
= 2N ( i n N log(pi )) ;
13
and this latter quantity is close to 2 N H(D) (for all i n the ratio N
N is close
i
to pi ):
The above observation motivates the following precise de…nition of typical
strings.
De…nition 30 Let > 0; the set of -typical strings of length N is the set
N 1 1
TN = w2 n : log H (D) < :
N Pr [w]
Remark 31 Notice that the typical strings are the strings whose information
content is close to the expected value N H (D) : The law of large numbers tells
us that DN will concentrate, as N grows, around those strings.
Then, if everything works, the set TN is a set of high probability. The

latter means that we can focus on not making errors when we are compressing
strings that belong to TN (all those strings should be recoverable). Moreover,
if TN were a small set, we could assign to each string in TN a small binary
code. Notice that if we put those two things together we get a coding scheme
that assigns small binary codes to the typical strings, and which assigns a er-
ror message to the strings that have a negligible probability of occurrence. It
happens that TN is quite small:
Exercise 32 Prove that jTN j < 2N H(D)+N :
And it happens that TN is very likely:
Exercise 33 Let XN be the random variable de…ned by

1
XN (w) = log :
Pr [w]
Compute the expected value and the standard deviation of this random variable.
Prove that
2
N
Pr [w 2 TN ] > 1 2 ;
w2DN N
n N
2
where N is the standard deviation that you computed.
The idea of Shannon’s code is quite simple:

Discard the strings in co-TN ; and code the strings in TN by embedding
N H(D)+N
this latter set into the set f0; 1g :
Thus, it only remains to do two things:
1. Let N goes to in…nity.

2. Choose a suitable value of (as a function of N ).
Exercise 34 Choose a suitably statement for The Source Coding Theorem and
1
prove it. Let us suggest for the value p
5
N
:
14
3.2 A Practical Coding Scheme: Lempel-Ziv
We know that we can achieve better information ratios if we code long blocks
instead of characters. We do the latter in the previous sections, we studied the
existence of block codes that achieve better ratios than the ratios achieved by
coding schemes based on the coding of characters. The block codes considered in
previous sections force us to take into account the probability distribution of the
source. If we are given a single message, and we have to compute the empirical
frequencies in this string, we cannot compute the code in real-time. We ask:
which information ratios can be achieved using real-time block coding? Or,
equivalently: what information ratios can be achieved using block coding but
ignoring the probability distribution of the source? If we ignore the probability
distribution of the source, we are necessarily using a coding strategy that works
for any source, and which can be considered as universal. We claim that being
universal implies being either lossless or ine¢ cient. Let us study a lossless
universal scheme invented by Abraham Lempel and Jakob Ziv [1], and which is
the algorithmic core of most of the commercial software that is used for data
compression.
3.2.1 The Algorithm

Let w 2 be the string to be compressed using The Lempel-Ziv Algorithm.
The idea is to parse the string w into distinct phrases
v 0 j v1 j j vc(n) :
Let us suppose that = fA; B; Cg and
w = AABABCBABAABACBBABBACC:
The parsing start with the shortest string on the left we have not seen before.
Notice that the 0-th phrase is always the empty string ": We continue with the
shortest phrase on the left we have not seen before. We get
" j A j ABABCBABAABACBBABBACC:
Remark 35 Notice that the 1-th phrase is always a character.
We continue with the shortest phrase on the left we have not seen before.
We get
" j A j AB j ABCBABAABACBBABBACC:
Continuing in this way we get
" j A j AB j ABC j B j ABA j ABAC j BB j ABB j AC j C:
Now, we have to describe this parsing using the alphabet f0; 1g. We begin
by observing that all the phrases that occur in this parsing are di¤erent, and
that each one of those phrases is equal to a previous phrase plus a character on
15
the right (all but the 0-th phrase). Thus, given i 1; there exists a single j < i
such that vi is equal to the vj xk(i) ; where xk(i) is a letter in n : Observe that vi
can be fully described with two data: the binary code of j and the binary code
of xk(i) :
Remark 36 We can always …x a binary coding of n : It is enough to …x a
pre…x-free code that injectively assigns to each letter a binary string of length
log (j n j). If n = 3 and 3 = fA; B; Cg we can …x the pre…x-free code
b = 00; B
A b = 01 and C
b = 11;
or any other pre…x-free code.

We can describe the parsing of w as follows
0; 00 j 1; 01 j 10; 11 j 00; 01 j 10; 00 j 101; 11 j 100; 01 j 010; 01 j 001; 11 j 000; 11:
Remark 37 We wrote, in some cases, additional 0’s on the left. We did the
latter to maintain the following condition: the number of bits before the comma
keeps increasing.
If one is given the above parsing of w, and the above coding of 3; he can
easily reconstruct w.
Exercise 38 Let u 2 fA; Bg ; suppose
b = 0 and B
A b = 1;
and suppose that u is parsed as

0; 0 j 1; 1 j 10; 1 j 00; 1 j 010; 0 j 101; 1 j 100; 1 j 011; 0 j 0111:
Compute u:
Now suppose that one is given the above parsing but without the commas.
Notice that it is still easy to recover u: Are the dividers necessary?
Exercise 39 Prove that the dividers are not necessary. Conclude that: if one
is given u 2 and a pre…x-free coding of ; then he can use Lempel-Ziv to
compute a binary code of u: Moreover, if one is given this binary code, he can
fully reconstruct string u:
3.2.2 Performance Analysis

Suppose we have an information source with probability distribution fp1 ; :::; pn g,
let w be a random string of length m produced by this source, and let w b be the
Lempel-Ziv code of w: What can be said about the length of w? b We prove that
the expected length of w b is bounded by mH (p1 ; :::; pn ) + o (m) ; which is close
to Shannon bound, and which is better than the average length achieved by the
optimal coding of characters.
The code wb is built by breaking w in c (m) phrases y1 yc(m) . Each one of
those phrases is broken up into a reference to a previous phrase and a letter of
the alphabet. Thus, the length of w b is equal to c (m) (log c (m) + log n) :
16
Exercise 40 Set
m
Y
Q (w) = pw[i] :
i=1
Notice that Q (w) is the probability of seeing w: Prove that

c(m)
Y
Q (w) = Q (yi ) :
i=1
Set
cl = jfi c (m) : jyi j = lgj ;
we have
Lemma 41 Ziv’s Inequality

The inequality X
log Q (w) cl log cl
l
holds.
Proof. Write Q (w) as Y Y

Q (yi ) :
l jyi j=l
X cl
1
Notice that Q (yi ) 1: We get that Q (w) is bounded above by cl :
jyi j=l
Taking logs we get that
X
log Q (w) cl log cl
l
and the lemma is proved.
Exercise 42 Let be a probability distribution, and suppose that the support

of is a …nite set of positive integers. Suppose also that the expected value of
is equal to N: Prove that the entropy of is O (log (N )) :
Lemma 43 The inequality
m
log Q (w) c (m) log m O log
c (m)
holds.
Proof. We have
X X cl cl
cl log cl = c (m) log c (m) + c (m) log :
c c
l l
17
X X
cl cl cl cl
The last term is c (m) times c log c ; and c log c is the entropy
l l
of the distribution
cl
(l) = :
c (m)
X
cl
The expected value of this distribution is l c(m) ; and it is equal to m: We get
l
that
m
log Q (w) c (m) log m O log ;
c (m)
and the lemma is proved.
Now, we can prove that Lempel-Ziv is asymptotically e¢ cient
Theorem 44 The inequality
b
jwj m H (p1 ; :::; pn ) + o (m)
holds.
Proof. We suppose that w is a long string produced by an information source

whose probability distribution is fp1 ; :::; pn g : Set
mi = jfj m : w [j] = igj :
We have that Y
Q (w) = pm
i ;
i
i n
and by the law of large integers we have that mi is approximately equal to mpi :
Thus, taking logs we get that
X
log Q (w) = m pi log pi = m H (p1 ; :::; pn ) ;
i n
and then we get that
m
m H (p1 ; :::; pn ) + O log c (m) log c (m) :
c (m)
The theorem is proved.
3.3 Concluding Remarks

What are the compression codes that is worth to decode? Can we decode
those codes e¢ ciently? The results consigned in this lecture allow us to give a
de…nitive answer to those questions: if we restrict ourselves to lossless codes,
then it su¢ ces if we can decode the Lempel-Ziv code, which can be decoded in
real-time. We have to remark that this is the simplest scenario that we have to
consider in information theory, the single noise-free scenario.
18
References
[1] J. Ziv, A. Lempel. Compression of individual sequences via variable-rate
coding. IEEE Transactions on Information Theory 24 (5): 530-541, 1978.
19
4 Communication Over Noisy Channels and Shan-
non’s Channel Coding Theorem
Alice’s gran and the young operator agreed on a coding scheme. Let us suppose
that they chose to use Lempel-Ziv. They used this scheme to compute a short
binary codeword for the message written by Alice’s Gran using the alphabet n .
Now they want to transmit this binary codeword using the telegraphic channel.
So far we have supposed that the binary channel to be used is a noiseless channel.
This latter assumption is unlikely.
4.1 Noisy Telegraphs

We use the symbol cw to denote the binary codeword computed by Alice’s gran
and her young friend. After computing cw they are ready to begin transmis-
sion. They want to use the binary channel that is administrated by the young
operator. However, he suddenly recalls that this channel is a noisy channel.
The latter means that they cannot reliably communicate cw using this channel
unless...
Remark 45 Take into account that, after compression, the information content
of Alice message is distributed over w in a complex way. The latter implies
that the receiver wont have the possibility of recover the corrupted bits only by
observing the adjacent characters and doing some inferences, (as it is usually
done with intelligible but corrupted messages written in natural language).
Should Alice’s gran cancel transmission? Shannon’s Channel Coding Theo-

rem tells her that it is possible to suitably code her binary messages, enlarging
those messages by the embedding of redundant information, and making those
binary codewords more resilient to noise. Alice’s Gran will have to decide if it
is worth to pay the price: the encoded messages are longer than the messages
she wants to transmit.
Remark 46 Communication over real-world channels requires two encodings.

The …rst encoding makes the original messages become short strings written in
the channel alphabet. This coding corresponds to source coding (compression).
The second encoding, which is an error correcting coding, acts on the output of
the compression algorithm. Thus, given a message written by Alice’s gran, say
M , the string to be transmitted is not M but a codeword EC(C(M )), here the
symbol C denotes the compression code and the symbol EC denotes the error-
correcting code. So far, we have chosen the compression algorithm C, and we
have computed the binary codeword C (M ) : It remains to choose a suitable error
correcting code EC:
Remark 47 The receiver is supposed to know those two codes, otherwise he

could not decode the received messages. Thus, if Alice’s gran decides to trans-
mit, the friendly operator will have to call his colleague at the other end of the
20
telegraphic channel, and communicate him those two codes. In real world those
two algorithms are carefully chosen by the administrator of the communication
network, and unconsciously used by most users of the network.
4.2 Noisy Channels

Probability theory is concerned, among other things, with the study of pairs of
random variables that are jointly and not independently distributed. Do you
have problems …nding a natural example of one such pair? After reading these
lectures, the last lectures of this booklet, a natural example should spring to
your mind whenever required:
Example 48 Let X the random variable that models the choices of Alice’s
Gran, and let Y be the random variable that models the choice of the binary
strings that are seen by the receiver. The pair (X; Y ) is a pair of random vari-
ables that are jointly and not independently distributed.
De…nition 49 A channel is a triple ( ; ; d) ; where is the source alphabet,

is the channel alphabet, and d is a probability distribution over : The symbol
P (x; y) denotes the probability of the following event: Alice’s Gran transmit x
and the receiver observes y: The conditional probability P (x j y) represents the
probability that the receiver observes y given that Alice’s Gran transmits x:
Let us consider three examples of noisy channels. The …rst one is The
Typewriter Channel. The second one is the Symmetric Binary Channel (SB-
Channel, for short), and the third one is The erasure Channel.
The typewriter channel is de…ned by:
1. The source and the channel alphabet are both equal to f0; 1; :::; 26g :
2. The probability distribution d is de…ned by
8 1
< 3 ; if the distance between x and y is less than 2;
P (x j y) = or fx; yg = f0; 26g :
:
0, otherwise.
Remark 50 The typewriter channel corresponds to the following hypothetical

situation. Someone is writing messages with a typewriter. The original messages
were handwritten on a piece of paper, and the transmission occurs when the man
writes those messages, character by character. We suppose that the keyboard is
a linear array containing the 27 characters 0; :::; 26: The man who uses the
typewriter can make the following errors along transmission: when he wants to
type the character i; he press one of the keys i 1; i:i + 1; each with probability
1
3:
The SB-Channel is de…ned by:
1. The source and the channel alphabet are both equal to f0; 1g :
21
2. The probability distribution d is symmetric, that is: there is a parameter
f , called noise, such that
P (0 j 1) = P (1 j 0) = f and
P (0 j 0) = P (1 j 1) = 1 f:
Notation 51 We use the symbol SB (f ) to denote the symmetric binary chan-

nel with noise f:
The q-ary erasure channel with noise e is given by:
1. The input alphabet is f0; 1; :::; q 1g :

2. The output alphabet is equal to f0; 1; ; :::; q 1; ?g :
3. Let a be an input alphabet, we have that
Pr [a becomes equal to ?] = e:
Pr [a does not become equal to ?] = 1 e:
We use the symbol EC (q; e) to denote the above channel.

Suppose we are using a channel C for communication. If Alice’s Gran trans-
mits x the receiver sees C (x) : Thus, the information conveyed by the channel
is equal to I (X; C (X)) ; where X models the random choices made by Alice’s
Gran, and C (X) models the random choice of the strings that are seen by the
receiver.
It is very important to observe that we do not have direct control over Y;
which is determined by X and the channel. On the other hand, we have total
control on X: we choose the messages, we choose the coding scheme for those
messages and in this way we choose the probability distribution of the source.
We claim that the art of channel coding corresponds to choose a probability
distribution for the source, (to choose a random variable X), that maximizes
the information conveyed by the channel, that is: to choose a probability dis-
tribution that maximizes the function I (_; C (_)) : The latter lead us to the
following de…nition.
De…nition 52 The capacity of C is equal to
max fI (X; C (X)) : X is a probability distribution for the sourceg :
Exercise 53 Consider the channel SB (f ). Prove that there exists a probability

distribution of the source that maximizes the information conveyed by this chan-
nel, and prove that this distribution is the uniform distribution. De…ne Hb (f )
as
1 1
f log + (1 f ) log :
f 1 f
Use the latter to prove that the capacity of the channel is equal to 1 Hb (f ).
22
Exercise 54 Consider the typewriter channel. Prove that the probability distri-
bution of the source that maximizes the information conveyed by this channel is
the uniform distribution. Use the latter to compute the capacity of the channel.
The Channel Coding Theorem estates that the capacity of C is the supremum
of the information ratios that can be reliably achieved for this channel. Let us
estate this theorem precisely.
N
De…nition 55 A N -block code for the channel ( ; ; d) is a set S : The
information ratio of S is equal to
logj j(jSj) log (jSj)

=
N N log (j j)
:
A decoder for S is a function D : N ! S [ f g : The extra symbol is used

to indicate failure (impossibility of decoding).
De…nition 56 Let D be a decoder for S; the probability of block error is equal

to
max fPr [sin 6= sout j sin ] : sin 2 Sg :
The Theorem of Shannon estates the following:

Given a channel ( ; ; d) with capacity C; and given " > 0; there exists N;
a N -block code S; and there exists a decoder D such that the inequality
log (jSj)
>C "
N log (j j)
holds, while the probability of block error is kept below ": Moreover, if the
information ratio R is asymptotically achieved for the channel ( ; ; d) with a
negligible probability of block error, then the inequality R C must hold.
We can easily con…rm the theorem in the case of the typewriter channel. We
can create a completely error-free communication strategy using a block code
of length N = 1: we use only the letters B; E; H; :::; Z, i.e., every third letter.
These letters form a non-confusable subset of the input alphabet. Thus, any
output can be uniquely decoded. The number of inputs in the non-confusable
subset is 9, so the error-free information rate of this system is
log (9) 2
= ;
log (27) 3
and this is precisely the capacity of the channel.
Exercise 57 Con…rm Shannon’s Coding Theorem for the Typewriter Channel.
23
4.3 Sketch of the Proof of Shannon’s Theorem
Let ( ; ; d) be an arbitrary channel. We prove something weaker, we prove
that the information ratio of any lossless code is bounded above by the capacity
I (X; Y ) : To prove the latter we consider the extended channel that corresponds
N
to N uses of the original channel. This extended channel has j j possible
N
inputs and j j possible outputs. We make use of large block length. If N is
large, the extended channel looks a lot like the noisy typewriter: any particular
input w is very likely to produce an output within a small subspace of the
output alphabet. So, we can try to …nd a non-confusable subset of the inputs
that produce essentially disjoint output sequences.
Let N be a large positive integer. The number of typical output sequences is
2N H(Y ) , all having similar probability. Given a typical input sequence w, there
are about 2N H(Y jX) probable output sequences. Suppose that the channel is
not very noisy. Then, the latter quantity becomes strictly smaller than the
former, and the gap between those two quantities grows as N grows. Then, we
can imagine taking a large enough N and restricting ourselves to a subset of the
typical inputs such that the corresponding typical output sets do not overlap.
The maximum number of non-overlaping subsets of size 2N H(Y jX) ; included in
a set of size 2N H(Y ) , is bounded above by the ratio
2N H(Y )
= 2N (H(Y ) H(Y jX))
2N H(Y jX)
= 2N (I(X;Y ))
:
Thus, the information ratio achievable by any error-free code is bounded

above by I (X; Y ) : It remains to prove that this ratio can be asymptotically
achieved if one relaxes a little bit the conditions, and allows negligible probabil-
ities of block error. We leave this work to the reader, who can consult one of the
excellent references in Information Theory (see [1] and the references therein).
From now on we focus on the symmetric binary channel. However, the study
of this channel will force us to study a completely di¤erent type of channels.
References
[1] D. MacKay. Information Theory, Inference, and Learning Algorithms. Cam-
bridge University Press, Cambridge, 2003.
24
5 Good Codes For The Symmetric Binary Chan-
nel
A linear block code of length N and dimension k is a linear embedding c : Fk !
FN : Code c can be suitably represented by a matrix Gc that we call the generator
matrix of the code, and which is nothing more than the matrix representation
of the linear map c : matrix Gc is equal to
c (e1 ) c (ek ) ;
where fe1 ; :::; ek g is the canonical basis of Fk : Notice that the codewords are the
linear combinations of the columns of Gc ; and we can identify the code either
with the linear map c or with the range of c: By an abuse of notation we use
the symbol c to denote the latter vector subspace of FN : Notice that the vector
space c can also be described by a matrix whose rows (columns) constitute a
basis of its orthogonal complement.
Remark 58 We de…ne the orthogonal complement of c; the set c? ; as the set
v 2 FN : for all w 2 Range (c) the equality hv; wi = 0 holds ;
where hv; wi is de…ned as
v [1] w [1] + + v [N ] w [N ] ;
with the arithmetical operations computed in the ground …eld F2 . We have that
set c? is a vector subspace of dimension N k : the dimension of c? complements
the dimension of c: However, it is not longer true that c and c? have an empty
intersection: suppose that N is an even integer, notice that the equality hv; vi = 0
?
holds for all v 2 FN 2 ; and notice that it implies the equality hvi \ hvi = hvi :
Thus, we should use the term orthogonal pseudo-complement of c to refer the
set c? : We we will stick, by an abuse of language, with the term orthogonal
complement to designate the set c? .
Exercise 59 Prove that the equality
?
dim (Range (c)) =N dim (Range (c))
actually holds.
Let w1 ; :::; wN k be a basis of c? ; and let
2 3
w1
6 .. 7
P Cc = 4 . 5
wN k
be the N k N matrix whose rows are those basis vectors. We have that
w 2 Range (c) ; if and only if, the equality
P Cc w = 0
25
holds. We say that P Cc is a parity check matrix for c: We have that a linear
block code c can be completely described by a parity check matrix.
Suppose we are given the matrices Gc and P Cc ; it is quite straightforward
to:
1. Encode a message w 2 Fk : the code of w is equal to Gc w:
2. Recognize the codewords: u is a code word, if and only if, P Cc w = 0:
Decoding is not so easy. Consider the problem
Problem 60 Maximum-Likelihood Decoding of Binary Linear Codes

Input: (H; d; v) ; where and d is a positive integer, H is a N k matrix
with entries in F2 and v 2 FN2 :
Problem: decide if there exists w 2 Range (H) such that dH (v; w) d:
It is known that the above problem is NP-hard, and it implies that optimal
decoding of binary linear codes is not feasible
Remark 61 If we …x a linear block code c (a matrix H) we get a problem with
a …nite number of instances. Such a problem cannot be NP-hard, such a problem
can be solved in time O (1) : Then, what is the problem?
First at all, we have to say that we cannot simply …x a code like The Ham-
ming Code. We are obliged to aspire to greater and greater code lenght. Suppose
we are stubborn and we insist on a speci…c code length. This code length has to
be very large in order to be useful. Then, we get a …nite problem that is con-
stituted by a …nite and large number of hard instances of a NP-hard problem.
We get almost the same conclusion: the maximum likelihood decoding of binary
codes is not practical for any useful size.
There are sub-optimal techniques based on message passing (see below) and
which give excellent results that can be practically implemented. We will study
those decoding techniques in a future lecture. Let us focus for a while on
the following question: are there good linear codes for the symmetric binary
channel? If we want to cope with the latter question, we will have to give an
answer to the following question: what is a good code?
Let us try an answer to the above question. A good code for a symmetric
binary channel SB (F ) is a code c such that:
1. It is easy to encode messages using this code.
2. The information ratio of the code approaches Shannon’s limit.
3. The code has good error correction capabilities (for the channel f ).
4. Practical decoding is possible.
We know that the …rst property holds for any linear code. It remains to
investigate wether there exist linear codes for which the three additional prop-
erties also hold.
26
5.1 Reliability and Minimum Distance
Let f < 21 : Shannon’s Channel Coding Theorem tell us that we can reliably
communicate over the channel SBC (f ) at any transmission rate that is below
the capacity of the channel, which is equal to 1 H2 (f ) ; (here the symbol
H2 (f ) denotes the binary entropy of ): Then, if we want to prove that there
exists good codes for the symmetric binary channel, will have to prove that given
f < 12 and given " > 0 there exists a code whose information ratio is greater
than 1 H2 ( ) " and which can be used for reliable communication over the
channel SBC (f ) : Let us ask: how can we ensure that a code is reliable?
De…nition 62 Let c FN
2 be a code, the minimum distance of c is equal to
min fdH (v; w) : v; w 2 c & v 6= wg :
We use the symbol d (c) to denote the minimum distance of c:
Exercise 63 Let c be a linear code, prove that
d (c) = min fkvkH : v 2 c f0gg ;
where kvkH denotes the Hamming weight of v; which is equal to
jfi N : v [i] 6= 0gj :
Suppose that Alice uses SBC (f ) to transmit N bits. Let K be the number
of bits incorrectly transmitted. The expected number of those bits is equal to
f N: However, if N is small the ratio fKN could be far away from 1 with a high
probability. The law of large numbers ensures that the latter does not occur
N
when we make N goes to in…nity. Thus, suppose that N is large, let w 2 f0; 1g
be the transmitted message and let u be the received message. The Hamming
distance dH (w; u) is equal to the number of errors occurred along transmission.
Let " > 0: We have that dH (w; u) is less than (f + ") N with high probability,
and if N goes to in…nity this probability goes to 1: Now suppose that we are
using a block code of length N; and suppose that the minimum distance of c is
greater than (2f + ") N for some " > 0: We have:
1. The inequality dH (w; u) < (f + ") N holds with high probability.

2. If v is a codeword, and v 6= w; then the inequality dH (v; u) > (f + ") N
holds with high probability.
We get that:
Let c be a code of length N and minimum distance greater than 2f N: If
N is large, the code can be used for reliable communication over the channel
SBC (f ) :
Notice that our quest for good codes reduces to show that given f < 21 ;
given a positive integer N and given " > 0 there exists a code of length N;
27
whose minimum distance is greater than 2f N and whose information ratio is
greater than 1 H2 (f ) ":
We show, in the last lecture of this booklet, that the latter can be done for
small values of f . We show that a random construction works: if we choose
the code at random, from a suitable ensemble, we get a good code with high
probability.
5.2 LDPC Codes

Low-density parity-check codes (LDPC codes, for short) are linear codes that
are constructed using parity check matrices that are very sparse. LDPC codes
constitute a conclusive answer to our question, that is:
1. LDPC codes are capacity-approaching codes, which means that there exist
practical constructions of LDPC codes that allow the noise threshold to
be set very close to the theoretical maximum (the Shannon limit).
2. LDPC codes can be probabilistically decoded in linear time using message
passing.
We will study LDPC codes for a while. We focus, in the remainder of the
chapter, on the above …rst item.
LDPC codes are also known as Gallager codes, in honor of Robert G. Gal-
lager, who developed the LDPC concept in his doctoral dissertation at the
Massachusetts Institute of Technology in 1960 [1]. LDPC codes were forgotten
until Gallager’s work was rediscovered in 1996 by MacKay and Neal [2]. LDPC
codes became, after the work of MacKay, the coding scheme of choice.
Remark 64 There is a second important family of capacity-approaching codes

that are called Turbo codes. Those latter codes are convolutional codes, and
they were the coding standard in the late 1990s. However, the advances in
low-density-parity-check codes have seen them surpass turbo codes in terms of
performance. In 2003, a LDPC code beat six turbo codes to become the error
correcting code in the new DVB-S2 standard for the satellite transmission of
digital television.
De…nition 65 A Gallager of length N is a linear code c FN2 which have a

very sparse parity check matrix. We say that the linear code c is a t-regular
Gallager code. if and only if, c has a parity check matrix P Cc such that all
the rows of P Cc are sparse vectors whose Hamming weight is equal to t (we are
interested in the cases t << N ).
Let c FN2 be a linear code, and let P Cc be a parity check for this code:
We can think of P Cc as the adjacency matrix of a bipartite graph constituted
by two sets of nodes: a set representing the transmitted bits, and another set
representing the constraints that the transmitted bits have to satisfy. Let
T Gc = (fb1 ; :::; bN g t fc1 ; :::; cR g ; Ec ) ;
28
where the adjacency relation is de…ned by
(bi ; cj ) 2 Ec , if and only if, P Cc [j; i] = 1:
If P Cc is a sparse matrix with few ones per column and few ones per row,
then the graph T Gc is a sparse graph. We say in the latter case that c is a
sparse graph code. Coding theory is based, nowadays, on sparse graph codes
achieving close to the Shannon limit. The archetypal sparse graph codes are
Gallager’s low-density parity-check codes.
Remark 66 The graph T Gc is called the factor graph of c. Factor graphs con-
stitute one of the many types of Tanner graphs that are related to codes. Tanner
graphs are named after Michael Tanner, who introduced them as a means to
create larger error correcting codes from smaller ones [3].
5.2.1 Expander Graphs and Expander Codes

The goal of this section is to prove the existence of Gallager codes that are good
for the binary erasure channel (when the noise f is close to 0). We use expander
graphs to achieve this goal.
De…nition 67 A bipartite graph with bipartition U; V is called a (t; ; )-expander,

if and only if, the conditions below all hold:
3D
1. > 4
2. All the nodes of U have degree t:

jU j
3. For all S U of size at most 2 the number of vertices in V that are
connected to S is at least jSj :
Let G be a (t; ; )-expander, and let IG the adjacency matrix of G: We have

that IG is the parity check matrix of a t-regular Gallager code cG ; and we say
that cG is an expander code. The properties of expander codes follow from the
combinatorial properties of expander graphs: we will show that good expanders
give place to good codes. Thus, the existence of good expanders ensures the
existence of good Gallager codes.
Given S U; we say that v is a unique neighbor of S if v has exactly one
neighbor in S:
Lemma 68 If G is a (t; ; )-expander, then every S U of size jSj jU j

has more than 2t jSj unique neighbors.
Proof. Suppose the S has u unique neighbors. The number of edges emanating
from S is equal to
t jSj u + (jSj u) 2;
D
which gives u 2 D> 2:
29
Lemma 69 The minimum distance of cG is at least N:
Proof. We show that each codeword has at least N ones. Suppose that this
is not the case. Let w 2 cG be a string with less than N ones. Let
Iw = fi 2 U : w [i] = 1g :
D D
We have that Iw has more than 2 unique neighbors. We get that at least 2
parity checks are not satis…ed.
Exercise 70 Suppose that jU j = N and jV j = R; suppose also that the rows

of IG are linear independent. Prove that the information ratio of cG is equal to
N R
N :
1
Suppose that f 16 : Set = 2f; t = log 1 and R = 8e2 t N : If G is
a (t; ; )-expander graph with N bit nodes and R parity check nodes, it yields
a code cG that can recover from an error rate of f: We have that the information
ratio of cG is equal to
N R 8e2 log 1 N 1
=1 1 8e2 log :
N N
If is small, the term on the right approaches the quantity 1 H2 (f ) : We get
that the existence of arbitrarily large (t; ; )-expanders ensures the existence
of good codes for SB (f ) (when f is small). It remains to prove that given those
parameter values there exists (t; ; )-expanders that are as large as desired.
1
Exercise 71 Suppose that f 16 : Set = 2f; t = log 1 and R =
2
8e t N : Pick a random t-regular bipartite graph with N bit nodes and R
parity check nodes by connecting each bit node to t random neighbors. Prove
that the probability that the constructed graph is a (t; ; )-expander is bounded
below by 21 :
5.3 Decoding of LDPC Codes

We showed that there exist LDPC codes that can be reliably used for commu-
nication over the symmetric binary channel. We proved that one can construct
reliable LDPC codes whose information ratio approaches Shannon’s limit. How-
ever, those results are completely useless if we cannot e¢ ciently decode those
codes. In the next few lectures we study the decodability of LDPC codes. This
lead us to study, for a while, a completely di¤erent type of channels, the so
called constrained noiseless channels.
References
[1] Gallager, R. G. (1963) Low Density Parity Check Codes. Number
21 in MIT Research monograph series. MIT Press. Available from
www.inference.phy.cam.ac.uk/mackay/gallager/papers/.
30
[2] D. MacKay, R. Neal. Near Shannon Limit Performance of Low Density Par-
ity Check Codes. Electronics Letters, July 1996.
[3] M. Tanner. A recursive approach to low complexity codes. IEEE Transac-
tions on Information Theory 27(5): 533-547, 1981.
[4] D. Spielman. Linear-time encodable and decodable error-correcting codes.

IEEE Transactions on Information Theory. 42 (6): 1723–1731, 1996.
31
6 Constrained Noiseless Channels And Trellises
In this lecture we study a completely di¤erent type of channels, the so called
Constrained Noiseless Channels. Those channels will lead us to the important
notion of trellis codes, which will be instrumental in the decoding of good codes
for the symmetric binary channel.
6.1 Constrained Noiseless Channels

Consider a binary channel in which 1’s are represented by pulses of electro-
magnetic energy, and the device that produces those pulses requires a recovery
time of one clock cycle after generating a pulse before it can generate another.
We can suppose that this channel is noiseless: pulses are transmitted as pulses,
and the absence of pulses is transmitted as the absence of pulses. We get a
constrained noiseless channel in which every 1 must be followed by at least one
0. We would like to communicate random binary strings over this channel as
e¢ ciently as possible.
Let us study for a while the special type of channels that are called con-
strained noiseless channels, and let us study the problem of communicating
e¢ ciently over those channels.
De…nition 72 A constrained channel is a channel over which not all strings

from the input alphabet may be transmitted.
We can begin with a constant length code that maps each source bit into
two transmitted bits. We can, for instance, map 0 to 00 and 1 to 10: This is a
code of ratio 21 , and it respects the constraints of the channel. We get that the
capacity of the channel is at least 0:5. Can we do better?
The above code is redundant. We know that the second bit will always be
a zero. We can achieve a smaller average transmitted length using a code that
omits the redundant zeroes. We can try a variable length code that maps 0 to
0 and 1 to 10: If the source symbols are used with equal frequency then the
average transmitted length per source bit is 23 : Can we do better?
A second, more fundamental approach counts how many valid sequences of
length N there are. Let SN be the latter quantity. We can communicate log (SN )
bits in N channel cycles by giving one name to each of these valid sequences.
This approach allows us to achieve the capacity of the channel. Then, if we want
to estimate this capacity we should consider the following counting problem:
Problem 73 #CNS
Input: ( ; n) ; where is a constrained noiseless channel and n is a pos-

itive integer.
Problem: compute the number of strings of length n that can be transmitted

over the channel.
32
The above problem lead us, in turn, to the problem of counting paths over
trellises.
6.2 Constrained Noiseless Channels and Automata

Let be a constrained noiseless channel. What is ? Notice that we have not
introduced a precise de…nition (representation) of discrete noiseless channels.
Those channels can be suitably represented by …nite state automata.
De…nition 74 A constrained noiseless channel is a partial automaton M =
(Q ; q ; A ; ; ) such that A = Q (all the states are accepting states).
That M is a partial automaton means that is a partial function from
Q to Q :
The states of M are intended to represent the states reached by the trans-
mitter after emitting a letter. The initial state q represents the default state,
which is the state of the transmitter before the start of the transmission. The
strings accepted by M are precisely the strings that can be transmitted through
:
Exercise 75 The channel discussed in previous section can be suitably repre-
sented by the automaton M = (Q; q0 ; A; ; ) ; where Q = A = f0; 1g ; q0 = 0;
= f0; 1g and is the partial function de…ned by the equalities
(0; 0) = 0; (0; 1) = 1 and (1; 0) = 0; (1; 1) =? :
Notice that is not de…ned at (1; 1): state 1 is intended to represent that a
letter 1 was emitted in the previous instant, and hence the character 1 cannot
be emitted at this precise instant.
We have: if we want to estimate the information ratio of ; then we have
to determine the asymptotic behavior of the language accepted by M ; that is:
we have to determine the asymptotic behavior of the census function of M :
Exercise 76 Investigate about census functions of regular languages.
It happens that the census function of an automaton like M can be easily
computed via the matrix transfer method, and the asymptotic behavior of this
function can be easily analyzed via the spectrum of the transition matrix of M :
However, we wont follow this rapid way, instead of this we will study trellis-
representations of automata and the problem of counting paths in trellises.
6.3 Automata and Trellises

Let M = (Q ; q ; A ; ; ) be a partial automaton representing a con-
strained noiseless channel ; and let n be a positive integer. We can use M
to construct a labeled trellis constituted by n + 1 layers, all identical to M ,
and which can be used to represent all the possible ways of using the channel n
times (to transmit n letters) as well as to represent all the n-step computations
of automaton M . The trellis T (M ; q ; n) is de…ned by:
33
The set of nodes of T (M ; q ; n) is the set QM f0; :::; ng.
The pair ((p; i) ; (q; j)) is a directed edge of T (M ; q ; n) labeled with the
letter a; if and only if, j = i + 1 and M (p; a) = q:
Remark 77 The trellis T (M ; q ; n) can be understood as the n-step phase

space of M :
Let L ( ) be the set of sequences that are valid for . We have that L ( ) is
equal to L (M ) ; and we have that this latter set is in bijective correspondence
with the set of n-length paths included in T (M ; q ; n) that start at (q ; 0) :
Thus, we naturally arrive to the following counting problem:
Problem 78 Counting Traverses of Trellises
Input: (T; q) ; where T is a …nite trellis and q is a node of the trellis.

Problem: compute the number of paths that start at q and reach the left-
most layer of T:
Exercise 79 Let be a constrained noiseless channel, and let T (M ; q ; n) be

its trellis representation of length n + 1: Prove that the latter digraph is acyclic.
We study an algorithmic solution to the above problem, which allows us to

solve a much more general problem: the counting of paths in digraphs. The
algorithmic paradigm used to design this algorithm is message passing (also
known as belief propagation, Sum-Product Algorithm or Pearl Algorithm [1]).
6.4 Message Passing

Consider a line of soldiers walking in the mist. The commander wishes to count
the number of soldiers in the line. This problem could be solved in the following
distributed way:
1. If you are the front soldier in the line, say the number ‘one’to the soldier
behind you.
2. If you are the rearmost soldier in the line, say the number ‘one’ to the
soldier in front of you.
3. If a soldier ahead of or behind you says a number to you, add one to it,
and say the new number to the soldier on the other side.
The commander can compute the global number of soldiers by simply adding
together the number said to him by the
soldier in front, and the number said to him by the soldier behind. This
clever trick makes use of a profound property of the total number of soldiers:
this quantity can be written as the sum of the number of soldiers in front of a
34
point, and the number behind that point, two quantities which can be computed
separately, because the two groups are separated by the point.
If the soldiers were not arranged in a line but were travelling in a swarm,
then it would not be easy to separate them into two groups in this way. If the
connection graph of this swarm contains cycles, it is not possible for an agent
in a cycle to separate the group into two groups, ‘those in front’, and ‘those
behind’. A swarm of agents can be counted by a modi…ed message-passing
algorithm provided that the agents are arranged in a graph that contains no
cycles.
6.4.1 Counting Paths in Acyclic Digraphs

A more profound task than counting squaddies is the task of counting paths in
acyclic digraphs. Suppose we are given a digraph G; we are given two nodes
s; t 2 G; and suppose we are asked to compute the number of paths from s to
t: We can do the latter using the following distributed algorithm.
On input (G; s; t) do.
1. Each node computes the number of its ancestors.

2. The heirs of s pass the message 1: The source nodes of G and s pass the
message 0:
3. Given a node v; which knows the number of its ancestors, that we denote
with the symbol ; this node keeps track of the messages that are passed
by its ancestors. After receiving the messages m1 ; :::; mNv ; it computes
m = m1 + + mNv and pass this message to all its heirs.
4. The output of the algorithm is mt ; which is the message computed by node
t:
Exercise 80 Prove that mt is the number of paths from s to t:
Exercise 81 Write down a distributed algorithm, similar to the above algo-

rithm, and which, on input (G; s; i; t) ; computes the probability that a random
path from s to t visits the node i:
The message-passing algorithm we used to count paths, and the message-

passing algorithm you designed to solve the above exercise, are examples of
The Sum-Product Algorithm. The ‘sum’takes place at each node when it adds
together the messages coming from its predecessors; the ‘product’takes places
at each node when it multiplies, before adding, each incoming message me by
a weight that was previously assigned to edge e. In the above two cases the
product was somewhat invisible since the weight assigned to each edge was the
weight 1: We will study a third Sum-Product algorithm. This algorithm will
demand each node to compute authentic products. The algorithm that we will
study in the remainder of this lecture is Viterbi’s Min-Sum Algorithm [2].
35
6.4.2 Viterbi (Min-Sum) Algorithm
The algorithm that we study in this section is Viterbi’s Min-Sum Algorithm [2].
De…nition 82 A weighted digraph is a pair (G; w) ; where G is a digraph and

w : E (G) ! Q is a weight function.
Suppose we are given a weighted digraph (G; w) ; and suppose that the weight
assigned to each edge corresponds to the cost of crossing it. Suppose also that
we are interested in computing cheapest paths in G, that is: suppose that we
are interested in e¢ ciently solving the following problem.
Problem 83 Shortest Paths
Input: (G; w; s; t) ; where (G; w) is a weighted digraph and s; t are nodes

of G:
Problem: compute the weight of the cheapest path from s to t:
We can solve this problem using a Sum-Product algorithm. We will discuss

this algorithm in the next paragraphs, but before of this a remark is required.
Remark 84 Executing a Sum-Product algorithm over a digraph G presupposes

the choice of a semiring (A; ; ) ; and it also presupposes that all the nodes
in G can compute the sum operation and the product operation of the semiring
A: This semiring could be the ring (R; +; ) ; or it could also be the tropical
semiring (R; min; +) :
Viterbi’s Min-Sum Algorithm corresponds to execute the Sum-Product al-

gorithm over weighted digraphs using (R; min; +) as the ground semiring, that
is:
Given a node v; given the set of its incoming edges, say fe1 ; :::; eNv g ; given
the weights of those edges, say fw1 ; :::; wNv g ; and given the incoming messages,
say fm1 ; :::; mNv g ; the node v has to compute the message
M
(wi mi ) = min (wi + mi ) ;
i Nv
i Nv
and transmit this message along all the edges that go out from v:
Exercise 85 Prove that the above tropical algorithm allows us to compute min-
imum distances in weighted digraphs.
Exercise 86 In image processing, the integral image I (x; y) obtained from an

image f (x; y) (where x and y are pixel coordinates) is de…ned by
y
x X
X
I (x; y) = f (u; v)
u=0 v=0
36
Show that the integral image I (x; y) can be e¢ ciently computed by message
passing. Show that, from the integral image, some simple functions of the image
can be obtained. For example, give an expression for the sum of the image
intensities f (x; y) for all (x; y) in a rectangular region extending from (x1 ; y1 )
to (x2 ; y2 ).
Exercise 87 A classic example of a constrained channel is the ‘Morse’channel,

whose symbols are the dot d, the dash D, the short space (used between letters
in Morse code) s, and the long space (used between words) S. The constraints
are that spaces may only be followed by dots and dashes. Find the capacity
of this channel in bits per unit time assuming that all four symbols have equal
durations. Consider the following question: how well-designed is Morse code for
English?
References
[1] J. Pearl. Reverend Bayes on inference engines: A distributed hierarchical
approach. Proceedings of the Second National Conference on Arti…cial Intel-
ligence AAAI pages 133–136, 1982.
[2] A. Viterbi. Error bounds for convolutional codes and an asymptotically op-
timum decoding algorithm. IEEE Transactions on Information Theory 13 (2):
260–269, 1967.
37
7 Decoding of Trellis Codes
Suppose that Alice and Bob are communicating over a noisy channel SB (f ),
and suppose that they agreed on a binary N -block code c: Let S 2N be the
b the received
set of codewords. Suppose Alice transmits a string w 2 S; and let w
string. Bob has to solve the following task: to compute the codeword w that
maximizes the quantity
b
Pr [ec (u j w)] = Probability that u was the transmitted string, given that
b is the string that we (Bob) received.
w
We can easily compute a related but di¤erent probability
b j u)]
Pr [rc (w b is the message that Bob will receive, given that
= Probability that w
u is the string that we (Alice) chose to transmit.
And we can use Bayes Theorem, to compute the former probability. We

have
b j u)) Pr [u]
Pr (rc (w
b =
Pr (ec (u j w)) ;
b
Pr [w]
were Pr [u] is the probability that u is the string that was chosen by Alice,
b is the probability that w
and Pr [w] b is the string that will be received by Bob,
(independently of Alice’s choice).
Exercise 88 Prove that this latter probability is equal to

X
b j u)) Pr [u] :
Pr (rc (w
u2S
If the code c is a good code, then the set of codewords is a large set, and it
becomes unfeasible to compute the latter sum. Fortunately, we are not forced
to compute the above sum, and the corresponding probability, given that we are
assuming that w b is the input of our problem, and we are looking for the u that
…ts wb the best. Thus, we focus on the numerator Pr (rc (w b j u)) Pr [u] : If we
assume that Alice chose the message uniformly at random from S; we get that
1
for all u 2 S the equality Pr [u] = jSj holds. Then, under the latter assumption,
we can dispense us of consider the factor Pr [u], and our problem reduces to the
following one:
Problem 89 The MAP Codeword Decoding Problem

N
b ; where c is a N -block code and w
Input: (c; w) b 2 f0; 1g :
b j u)) :
Problem: compute the codeword u that maximizes the quantity Pr (rc (w
38
Notice that we can solve the above problem using a brute force algorithm: we
compute Pr (rc (w b j u)) for all u in S; and then we choose the u that maximizes
the latter quantity. However, this approach is unfeasible for most codes. We
ask: can we do better than brute force? The bad news is that it is not always
possible to do better given that the decoding problem is NP-hard. The good
news is that it is possible to do a good job for a large family of codes.
7.1 Codes From Trellises

Let T (M ; q ; N ) be a trellis of length N: This trellis de…nes a N -block code
as follows: a codeword is obtained by choosing a traverse of the trellis and then
reading out the symbols on the edges of this traverse. Each valid path through
the trellis de…nes a codeword and each codeword corresponds to a traverse.
Thus, the codeword S is equal to L (M ) \ N :
Notation 90 Let T be a trellis of length N; we use the symbol cT to denote the
code that is de…ned by T , and we say that cT is a trellis code of length N:
We can use Viterbi’s Min-Sum algorithm to e¢ ciently solve the following
problem:
Problem 91 The MAP Codeword Decoding Problem for Trellis Codes

N
b ; where cT is a trellis code of length N and w
Input: (cT ; w) b 2 f0; 1g :
b j u)) :
Problem: compute the codeword u that maximizes the quantity Pr (rcT (w
b We can assign
Suppose we are given as input the trellis T and the string w:
to edge ((p; i) ; (q; i + 1)) the cost
b [i] j t)]) ;
log (Pr [r (w
where t is the label of this edge and Pr [r (wb [i] j ti )] denotes the probability of
b [i] after transmitting bit t:
receiving bit w
Exercise 92 Why did we assign weight log (Pr [r (wb [i] j t)]) to edge ((p; i) ; (q; i + 1))
b [i] j t)]?
instead of the simpler weight Pr [r (w
After assigning the above weights to the edges in T we get a weighted trellis
that we denote with the symbol Twb :
Exercise 93 Shows that the output of Viterbi’s algorithm, on input Twb , corre-
sponds to the weight of the path in Twb that represents the solution of The MAP
Codeword Decoding Problem for the input (cT ; w) b :
Exercise 94 Viterbi’s algorithm, as de…ned in the previous section, allows us
to compute optimal weights. Now we want to compute optimal weights as well
as the paths achieving those weights. Your …rst work is to design a variant of
Viterbi’s algorithm for the computation of optimal paths in weighted trellises.
Your last work is to use the previous algorithm in the design of a decoding
algorithm for trellis codes.
39
References
[1] F. Kschischang, B. Frey, H. Loeliger. Factor graphs and the sum-product
algorithm. IEEE Trans. Info. Theory 47 (2): 498-519, 2001.
40
8 Trellis Representations of Linear Codes
Gallager proved that good LDPC codes can be constructed at random. This
work was neglected for long time because he did not provide decoding algorithms
for those codes. The existence of e¢ cient decoding algorithms for those codes
is related to the following fact:
If we can e¢ ciently construct trellis representations of LDPC codes, then
we can use those representations and Viterbi’s algorithm for e¢ cient decoding.
In this lecture we study the existence and construction of trellis representa-
tions of linear block codes.
8.1 Time Varying Trellises

A time varying trellis of length N is a layered digraph
(V0 t V1 t t VN ; E)
that satis…es the following (sequential) condition: for all (v; w) 2 E there exists
i < N such that v 2 Vi and w 2 Vi+1 : We say that N is the length of the trellis,
and we think of the interval f0; 1; :::; N g as the time axis of the trellis.
De…nition 95 A traverse in G is a directed path that connects a node of V0

with a node of VN :
De…nition 96 A labeled trellis is a pair (G; L) such that G is a trellis and L

is a function that assigns a label to each directed edge in G: We say that (G; L)
is a binary trellis if the set of labels is equal to f0; 1g :
N
Let (G; L) be a binary trellis, and let SG f0; 1g be the set that is
constituted by the strings that can be read along the traverses of G: The set
SG is a block code of length N: Notice that any block code of length N admits
a trellis representation:
N
Example 97 Let S f0; 1g be one of those codes, and let (GS ; L) be the
binary trellis of length N de…ned by:
1. Given i N , the set Vi is equal to
f(w; i) : w 2 Sg :
2. The set of edges is the set
f((w; i) ; (w; i + 1)) : w 2 S and i < N g :
3. The function L is de…ned by
L (((w; i) ; (w; i + 1))) = w [i + 1] :
41
Trellis (GS ; L) appears when one naively represent each string of S as a
path. Naive representations use to be useless, and this is the case with the
trellis (GS ; L) : The problem with this trellis is that it corresponds to a dilated
representation of the code: we have that jGS j = N jSj : Then, if S is large, the
trellis (GS ; L) is even larger. The trellis representations that can be of use are
the concise representations. What does it means concise?
On one hand we have that any linear code of length N can be represented by a
matrix of size O N 2 (linear codes have algebraic representations of polynomial
(quadratic) size). And, on the other hand, we have that representations of
superpolynomial size yield algorithms that run in superpolynomial time. We
get a conclusion: a concise representation of a block code of length N is a
representation whose size is polynomial in N: We ask: let c : Fk2 ! FN 2 be
a linear code, can we e¢ ciently construct a trellis representation of c of size
O N d ; for some …xed d that does not depend on the code?
8.2 Small Trellises
Let c a linear block code of length N: We try to construct a trellis representation

over the time axis f0; 1; :::; N g :
Let i N; the representation that we try to construct must assign to i a
…ber
Fi = f(v; i) : (v; i) is a node of the trellisg :
Then, to begin with, we try to construct the suitable …bers. We know that the
structure of those …bers now depend on the time index.
Recall that c is a vector subspace of FN
2 which contains many subspaces, and
which can be partitioned in many di¤erent ways. We use the symbol cP (i) to
denote the subspace of c that is de…ned by
cP (i) = fv 2 c : v [j] = 0 for all j ig ;
and we use the symbol cF (i) to denote the subspace
cF (i) = fv 2 c : v [j] = 0 for all j < ig :
We say that cP (i) is the past of the time index i; and we say that cF (i) is the
future. We have that cP (i) \ cF (i) is equal to f0g ; and we get that there exists
a third subspace cS (i) such that c can be decomposed as the direct sum
c = cP (i) cF (i) cS (i) :
Exercise 98 Prove the latter assertion.
Given a codeword w 2 c, this codeword can be expressed, in a unique way,
as a sum
wPi + wFi + wSi ;
where wPi 2 cP (i) ; wFi 2 cF (i) and wSi 2 cS (i) : Suppose that the minimum
distance of the code c is greater than 1: We get that:
42
1. cP (0) = cS (0) = f0g and cF (0) = c:
2. cP (N ) = fv 2 c : v [N ] = 0g ; cS (N ) = F2 and cF (0) = f0g
3. Each codeword w determines a sequence
0 = wS0 ; wS1 ; :::; wSN :
We ask: can we think of this sequence as if it were a sequence of states

determined by the codeword w? An a¢ rmative answer to the latter question
implies that the above sequence exhibits the Markovian property, that is:
Exercise 99 Let w; u be two codewords, suppose that wSi = uiS and w [i + 1] =

u [i + 1] ; prove that the equality
wSi+1 = ui+1
S
must hold.
We get, from the above exercise, that we can construct a trellis representation
of c using the vector spaces
cS (0) ; cS (1) ; :::; cS (N )
as the …bers of this trellis. The transitions are given by:

There exists a transition labeled a; from state (x; i) 2 cS (i) fig to state
(y; i) 2 cS (i + 1) fi + 1g ; if and only if, there exists a codeword w such that
w [i + 1] = a; wSi = x and wSi+1 = y:
Exercise 100 Prove that there exists a bijective correspondence between the
sets of codewords and the set of traverses of the above trellis.
We use the symbol T (c) to denote the above trellis. Notice that T (c) is
completely determined by the linear structure of the code c: Is this representa-
tion small? Notice that the i-th …ber has size 2dim(cS (i)) ; and it suggests that
T (c) is a large trellis with …bers of superpolynomial size: We have:
Theorem 101 Let M be a trellis representation of c; and let M (i) be the i-th
…ber, the inequality jM (i)j 2dim(cS (i)) holds.
The latter assertion is called The State Space Theorem (see [1]), and it
implies that T (c) is a minimal trellis representation of the linear block code c:
We conclude that: if the code c has a small trellis representation, then T (c) is
small.
43
8.3 Construction of Minimal Trellises
We know, from previous section, that T (c) is the minimal trellis representation
of c. Can we construct this representation e¢ ciently?
Let i N: One can use basic linear algebra to compute the decomposition
cP (i) cF (i) cS (i) (to compute a basis of each one of the vector spaces
cP (i) ; cF (i) and cS (i)). The latter means that one can e¢ ciently compute
the …bers, even if those …bers are large. Then, the only thing that remains
to be done is to compute the transitions. We can solve the latter task using
a brute force approach: list the codewords, and represent each codeword as a
labeled path over the trellis T (c) : We know that brute force solutions use to
be unfeasible, and this is the case with the above solution: if c is a good code,
its dimension k is large, and the set of codewords is a large set of size 2k : Thus,
we have to try a di¤erent approach. Observe that computing the transitions
reduce to decide, of each possible triple ((x; i) ; (y; i + 1) ; a) ; wether this triple
is a transition or not. If the …bers are small, the number of those triples is also
small, and the work can be done triple by triple. The latter means that we can
focus on the following problem:
Let c be a linear block code of length N; let i N; let x 2 cS (i) ; let y 2
cS (i + 1) and let a 2 f0; 1g : Decide if there exists w 2 c such that w [i + 1] = a
and such that wSi = x and wSi+1 = y:
Let (x; y; i; a) be an instance of the above problem, deciding if the latter
instance is a positive instance corresponds to decide wether there exists w 2 c
such that:
1. 9u 2 cP (i) 9v 2 cF (i) (u + v + x = w) :
2. 9r 2 cP (i + 1) 9s 2 cF (i + 1) (r + s + y = w) :
3. w [i + 1] = a:
We observe that the latter question can be e¢ ciently decided using elemen-
tary tools of linear algebra.
Exercise 102 Prove the latter assertion.
We conclude that: if dim (cS (i)) is small for all i; then the trellis T (c) is a
small trellis that can be e¢ ciently computed from the algebraic representation
of the code c:
8.4 Altogether
If we put all the pieces together, we get a communication protocol for binary
symmetric channels:
1. Alice and Bob agree on a Gallager code c that is good for the channel at
hand: the capacity is close to the capacity of the channel, and the error
probability (or the minimum distance) is small enough for Alice and Bob.
44
2. Alice use a generator matrix of the code c for encoding.
3. Bob computes the trellis T (c) ; and uses this trellis together with Viterbi
Algorithm for decoding.
This very schematic communication protocol is, to some extent, the commu-
nication protocol of common use. The reader should notice, at this point, that
we can solve, algorithmically, each one of the above three tasks.
Exercise 103 The reader should convince himself of latter assertion.
References
[1] A. Vardy. Trellis structure of codes. in V. S. Pless and W. C. Hu¤man, Eds.,
Handbook of Coding Theory. Amsterdam, Elsevier Science, pp. 1989–2118,
1998.
45
9 Good Codes for the Erasure Channel
In this lecture, the last lecture of this booklet, we study the existence of good
codes for the erasure channel.
9.1 Erasure Codes

Erasure channels are of great importance. Files sent over the internet are
chopped into packets, and each packet is either received without error or not
received. A simple channel model describing this situation is a q-ary erasure
channel. The input alphabet of this channel is f0; 1; 2; :::; q 1g ; and each char-
acter in this alphabet has a probability 1 e of transmitting the input without
error, and a probability e of delivering the output ?. The alphabet size q is 2l ,
where l is the number of bits in a packet.
De…nition 104 A q-ary erasure code is a function c : Fkq ! FN q with N >

k: Code c is an useful code for the q-ary erasure channel if there exists n <
N such that, given a message w of length k; this message can be recovered
after transmission if at least n characters of c (w) arrive safely to the receiver.
k
The fraction r = N is called the code rate. The fraction nk , where n denotes
the number of symbols required for recovery, is called the reception e¢ ciency.
Optimal erasure codes have the property that n = k:
An second important application of optimal erasure codes is related to data

storage. Suppose you wish to make a backup of a large …le, but you are aware
that your magnetic tapes and hard drives are all unreliable in the sense that
catastrophic failures, in which some stored packets are permanently lost within
one device, occur at a certain nonnegligible rate. How should you store your
…le? An erasure code can be used to spray encoded packets all over the place, on
every storage device available. Then to recover the backup …le, whose size was k
packets, one simply needs to …nd k packets from anywhere. Corrupted packets
do not matter; we simply skip over them and …nd more packets elsewhere. This
method of storage also has advantages in terms of speed of …le recovery.
9.2 Reed-Solomon Codes

We have learnt to think of messages as if they were coordinate vectors. Thus,
given a string w1 w2 wk ; we think of this message as the k-dimensional vector
(w1 ; :::; wk ) : Notice that we can think of this vector as if it were the polynomial
pw (X) = a1 + a2 X + + ak X k 1
:
Remark 105 We represent w; in both cases, as a k-dimensional vector.
If we represent messages as polynomials we gain a little bit of algebraic

structure: polynomials can be summed up, can be multiplied by scalars, and
in addition they can be multiplied by other polynomials. More important than
this: polynomials can be evaluated.
46
Let c1 ; :::cn be n di¤erent elements of a large enough …nite …eld F, and let
pw (X) be the above polynomial. We can evaluate pw (X) at the points c1 ; :::cn :
If we do the latter we get a new codeword, the vector (pw (c1 ) ; :::; pw (cn )) : If
n > k; the latter vector is resilient: this vector can be fully recovered from
any subvector of dimension k: The latter means that the information content
of each entry is distributed over all the k-dimensional subvectors, and it makes
hard to lost this information. This fact allows us to think in the following coding
strategy:
Let w = (w1 ; :::; wk ) be the message to be stored, choose k + d di¤ erent
elements of F, say the elements c1 ; :::; ck+d , and compute the vector
T k+d (w) = (pw (c1 ) ; :::; p (ck+d )) :
If one stores the vector T k+d (w), and there occur less than d deletions, he
will be able to reconstruct the polynomial pw (X) and the original message w
using the non-deleted entries.
Exercise 106 Prove that T k+d is a linear code.
A code like T k+d is a Reed-Solomon Code [1]. Those codes are the algebraic
codes that are constructed by representing messages as polynomials, and by
evaluating those polynomials at large subsets of the ground …eld.
Exercise 107 Prove that the minimum distance of T k+d is equal to d:
Notice that a Reed-Solomon Code like T k+d is a k-dimensional code of block

length k + d and minimum distance d: Thus, we have that the block length of
T k+d is equal to the dimension of the code plus its minimum distance. There
is a classical result of coding theory, called Singleton’s Inequality, which asserts
that the inequality
k+d m
always holds, where m is the block length.
Exercise 108 Prove Singleton’s Inequality.
Singleton’s inequality implies that codes satisfying the equality k+d = m are
optimal (maximum distance separable codes). We get that Reed-Solomon codes
are optimal. However, in despite of its optimality those codes are unpractical:
Exercise 109 Prove that the coding complexity and the decoding complexity of
Reed-Solomon codes is O n log2 (n) (Hint: you have to use the Fast Fourier
Transform).
It can be proved that the coding complexity and the decoding complexity
of Reed-Solomon codes is n log2 (n) : If n is large, the factor log2 (n) is a
serious drawback: take into account that there exists optimal erasure codes
whose coding and decoding complexities are linear in n (see below). Moreover,
47
with a Reed-Solomon code, as with any block code, one must estimate the
erasure probability e and choose the code rate before transmission. If we are
unlucky and e is larger than expected, and the receiver receives fewer than k
symbols, the message is completely lost.
Remark 110 Tornado codes are a special type of …xed rate erasure codes which
require more redundancy than Reed–Solomon codes, but which are much faster
to generate. Software-based implementations of tornado codes are about 100
times faster on small lengths and about 10; 000 times faster on larger lengths
than Reed–Solomon erasure codes [2]. Since the introduction of Tornado codes,
many other similar erasure codes have emerged, most notably Online codes (see
[2] and the references therein).
9.3 Sparse Graph Codes and the Erasure Channel: Itera-

tive Decoding
Let
G = (fb1 ; :::; bN g t fc1 ; :::; cR g ; E)
be a bipartite graph. Suppose that all the nodes in the set fb1 ; :::; bN g have
degree r; and suppose that all the nodes in the set fc1 ; :::; cR g have degree t:
Let IG be the adjacency matrix of this graph, we have that IG is the parity
check matrix of a t-regular Gallager code of length N that we denote with the
symbol cG .
r
Exercise 111 Prove that the information ratio of cG is equal to 1 t:
We can use the code cG as an erasure code, and we can use the graph G for
iterative decoding using a variation of The Sum–Product Algorithm. Suppose
that a codeword w1 wN 2 cG is transmitted, and let u1 uN be the received
string. We set
I = fi N : ui = wi 6=?g :
Let bi be a node in the set fbj : j 2 Ig : Notice that the value assigned to
the latter node is known after transmission. Decoding (which corresponds to
computing the codeword that was transmitted) reduces to propagate the value
of those nodes through the code graph. The sum-product update rule reduces
simply to:
Let b be a bit node. Suppose that b is adjacent to a parity check node p; and
suppose that for all c 6= b such that c is adjacent to p we know the value of c:
Then the value of b is set equal to the mod-2 sum of all those c’s.
Since all unerased variables are correct, there is no chance that these rules
could produce a variable assignment that con‡icts with another assignment.
Let us study a special type of sparse graph codes that are used as erasure
codes, and which are decoding using iterative decoding.
48
9.3.1 Fountain codes
Fountain codes (also known as rateless erasure codes) are a class of erasure codes
with the property that a potentially limitless sequence of encoding symbols can
be generated from a given set of source symbols, in such a way that the original
source symbols can ideally be recovered from any subset of the encoding symbols
of size equal to, or only slightly larger than, the number of source symbols. The
term fountain or rateless refers to the fact that these codes do not exhibit a
…xed code rate.
A fountain code is optimal if the original k source symbols can be recovered
from any k encoding symbols. Fountain codes are known that have e¢ cient
encoding and decoding algorithms that allow the recovery of the original k source
symbols from any k’of the encoding symbols with high probability, where k’is
just slightly larger than k.
Fountain codes were created by Michael Luby at his company Digital Foun-
tain, the …rst company whose business is based on sparse-graph codes (see next
lecture). Luby invented LT codes, a special type of fountain codes in 1998.
A LT code works as follows:
Each encoded packet tn is produced from the source …le s1 sk as follows:
1. Randomly choose the degree dn of the packet from a degree distribution

(d) ; the appropriate choice of depends on the source …le size k (see
[2]).
2. Choose, uniformly at random, dn distinct input packets, and set tn equal
to the bitwise sum, modulo 2 of those dn packets.
This encoding operation de…nes a graph connecting encoded packets to

source packets. The decoder needs to know the degree of each packet that
is received, and which source packets it is connected to in the graph. There
is no problem with the latter if one wants to use LT-coding for data storage:
sender and receiver are the same agent. On the other hand, if Alice and Bob are
using LT-coding to communicate over the internet, this information can be com-
municated to Bob in various ways. LT codes admit e¢ cient iterative decoding.
The decoding process works as follows
1. Find a check node tm that is connected to only one source packet si ; (if
there is no such check node, this decoding algorithm halts at this point,
and fails to recover all the source packets).
(a) Set si = tm .
(b) Add si to all checks tr that are connected to si :
(c) Remove all the edges connected to the source packet sk .
2. Repeat (1) until all the source packets are determined.
49
The above probabilistic strategy comes to an end, if and only if, the halting
condition in step one is not reached before all the source packets are determined.
The latter can be ensured, with a high probability, if one chooses the right degree
distribution : occasional encoded packets must have high degree (i.e., d similar
to k) in order to ensure that there are not some source packets that are connected
to no-one. On the other hand, many packets must have low degree, so that the
decoding process can get started, and keep going, and so that the total number
of addition operations involved in the encoding and decoding is kept small.
Exercise 112 The ideal soliton distribution k is a probability distribution over

the set f1; :::; kg and which is de…ned by
1
(i) = k, if i = 1
:
k 1
i(i 1) otherwise
;
Prove that the expected degree under this distribution is equal to ln (k) :
The ideal soliton distribution behaves almost as required, but is does not
work in practice. The robust soliton distribution has two extra parameters, c
and ; and it is designed to ensure that the expected number of degree-one
checks is about
k p
S c ln k
rather than 1, throughout the decoding process. The parameter is a bound

on the probability that the decoding fails to run to completion after a certain
number k0 of packets have been received. The parameter c should be interpreted
as a constant of order 1, if our aim is to prove Luby’s main theorem about LT
codes; in practice however it can be viewed as a free parameter, with a value
somewhat smaller than 1 giving good results. Let us de…ne the robust soliton
distribution k;c; : First, we de…ne a positive function
8 S
< ki , if i = 1; 2; :::; Sk 1
S S
k;c; (i) = k ln , if = Sk
:
0, otherwise.
Then, we de…ne distribution k;c; as follows
k;c; (i) + k;c; (i)

k;c; (i) = X
k;c; (j) + k;c; (j)
j k
Luby’s key result is that (for an appropriate value of the constant c) receiving
k = k + 2 ln S S packets ensures that all packets can be recovered with
probability at least 1 .
Remark 113 Raptor codes are the …rst known class of fountain codes with lin-
ear time encoding and decoding. Raptor codes were invented by Amin Shokrollahi
50
[3]. Raptor codes are formed by the concatenation of two codes. A …xed rate
erasure code, usually with a fairly high rate, is applied as an outer code. The
inner code takes the result of the pre-coding operation and generates a sequence
of encoding symbols. Each encoding symbol is the XOR of a pseudo-randomly
chosen set of symbols from the pre-code output. The number of symbols which
are XOR’ed together to form an output symbol is chosen pseudo-randomly for
each output symbol according to a speci…c probability distribution. This distri-
bution, as well as the mechanism for generating pseudo-random numbers for
sampling this distribution and for choosing the symbols to be XOR’ed, must be
known to both sender and receiver.
References
[1] I. Reed, G. Solomon. Polynomial Codes over Certain Finite Fields. Journal
of the Society for Industrial and Applied Mathematics, 8 (2): 300–304, 1960.
[2] J. Byers, M. Luby, M. Mitzenmacher, A. Rege. Digital fountain approach to
reliable distribution of bulk data. ACM SIGCOMM Computer Communica-
tion Review 28(4): 1–11, 1998.
[3] A. Shokrollahi. Raptor Codes. IEEE Transactions on Information Theory,
52: 2551-2567, 2006.
51

Probabilityand Coding

Uploaded by

Copyright:

Available Formats

Probabilityand Coding

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probabilityand Coding

Uploaded by

Copyright:

Available Formats

Probability and Information Theory

P (a; a1 ; :::; an ) = max fP (z; a1 ; :::; an ) : z 2 dom (X)g

holds. The latter is a typical task in probabilistic inference: to determine the

1.1 Coding and Decoding

If character a2i 1 is lost, its information content can be recovered from

We have gained some resilience after inserting redundant information. Notice

De…nition 1 A q-ary linear error correcting code of length N is a vector space

1.1.1 Hamming Codes

(x1 ; :::; x4 ) 7! (x1 ; :::; x4 ; x1 + x2 + x3 ; x1 + x2 + x4 ; x1 + x3 + x4 ) :

1.2 Decoding, Probabilistic Inference and Bayes theorem

We have that this forward probability can be straightforwardly computed

1.2.1 Bayesian Inference: Hypothesis Testing

Let H be a hypothesis whose probability may be a¤ected by data (evi-

jfi N : v [i] 6= w [i]gj :

From now on we study the following problem

Problem 5 Decoding of Linear Codes

Input: (c; w) ; where c FN N

Problem: compute the vector v 2 c that minimizes the Hamming distance

2.1 Shannon’s entropy

the value of X is the possible outcome of the above race.

If we agree with Shannon’s de…nition of information content, then we have to

By convention we assume that 0 log (0) is equal to 0:

2.2 Symbol Codes: the Meaning of Entropy

Suppose that c is a pre…x-free code, and let cw be the codeword assigned to

Theorem 16 Let L be a pre…x-free code that minimizes the average length of

H (p1 ; :::; pn ) E [L] H (p1 ; :::; pn ) + 1;

This theorem tells us that the entropy of an information source determines

De…nition 17 A binary pre…x code for n is a pre…x-free set of strings

fw1 ; :::; wn g f0; 1g :

Theorem 18 Kraft’s Inequality

1. Let fw1 ; :::; wn g be a binary pre…x code of n , and

2. Let l1 ; :::; ln beX

The proof of the second item is analogous and we omit it.

Proof. Set li = d log pi e. It is easy to check that l1 ; :::; ln satisfy Kraft’s

jcw j jwj H (p1 ; :::; pn ) + jwj :

holds on average. We ask: is it possible to do better?

1. Is it possible to compress the message w beyond the above bound?

3.1 A Theoretical Coding Scheme: Shannon’s Source Cod-

De…nition 21 Let c be a coding scheme for message w, the compression ratio

Shannon’s Theorem states that:

1. H (p1 ; :::; pn ) is a lowerbound for the compression ratio.

3.1.1 Shannon’s Probabilistic Construction

where given i n the symbol Ni denotes the number of occurrences of letter i

= 2N1 log(p1 ) 2Nn log(pn )

Then, if everything works, the set TN is a set of high probability. The

Exercise 32 Prove that jTN j < 2N H(D)+N :

And it happens that TN is very likely:

Exercise 33 Let XN be the random variable de…ned by

The idea of Shannon’s code is quite simple:

1. Let N goes to in…nity.

3.2.1 The Algorithm

Let us suppose that = fA; B; Cg and

Remark 35 Notice that the 1-th phrase is always a character.

" j A j AB j ABC j B j ABA j ABAC j BB j ABB j AC j C:

or any other pre…x-free code.

and suppose that u is parsed as

3.2.2 Performance Analysis

Notice that Q (w) is the probability of seeing w: Prove that

Lemma 41 Ziv’s Inequality