Probabilityand Coding
Probabilityand Coding
Probabilityand Coding
J. Andres Montoya
May 21, 2020
1
1 Probabilistic Inference: Bayes Theorem
Let X; Y1 ; :::; Yn be random variables that are jointly distributed according to a
probability mass function P (x; y1 ; :::; yn ) : Suppose one knows that the equalities
Y1 = a1 ; :::; Yn = an hold, and suppose that he wants to say something about
the possible value of X: The most likely value of X is the element a for which
the equality
Example 2 r (T2 ) = 2:
2
If one wants to use this code for transmission, then, given a message (x1 ; :::; x4 ) ;
he has to compute three parity check bits, the bits x1 +x2 +x3 ; x1 +x2 +x4 ; x1 +
x3 + x4 ; and construct the codeword
(x1 ; :::; x4 ; x1 + x2 + x3 ; x1 + x2 + x4 ; x1 + x3 + x4 ) :
This is the codeword that is transmitted using the insecure channel at hand.
Suppose that this channel is a symmetric binary channel with noise f 2 0; 21 ;
that is: f is the probability of a bit-‡ip. Suppose also that Alice wants to
4
transmit a message w 2 f0; 1g , and suppose that Bob is aware of Alice’s desire.
Notice that, from the point of view of Bob, there are two random variables that
matter: variable X; whose value is Alice’s choice of a message, and variable Y;
whose value is the string of length 7 that will be seen by Bob after transmission.
Thus, transmission will completely reveal the value of Y; but it wont completely
reveal the value of X because of the noise. The best thing that Bob can do is
to determine the most likely choice of Alice, given the known value of Y: Can
it be done? Notice that the decoding question is a typical inference task that
must be solved probabilistically.
f , if yi 6= hi
Pr (Yi = yi j Hi = hi ) = :
1 f; otherwise
3
Theorem 3 Bayes Theorem
Let X; Y be two random variables that are jointly distributed according to a
probability distribution P (x; y) ; we have
Pr (Y = y j X = x) Pr (X = x)
Pr (X = x j Y = y) = :
Pr (Y = y)
It is very easy to prove the above theorem (in despite of its strength). We
leave the proof as an exercise for the rider.
Exercise 4 Prove Bayes theorem, and use it to compute the above backward
probabilities.
Bayes Theorem is named after Thomas Bayes (1701–1761), who studied how
to compute a distribution for the probability parameter of a binomial distribu-
tion. Bayes’unpublished manuscript was signi…cantly edited by Richard Price
before it was posthumously read at the Royal Society [1]. Price edited Bayes’
major work “An Essay towards solving a Problem in the Doctrine of Chances”,
which appeared in Philosophical Transactions, and contains Bayes Theorem.
Price wrote an introduction to the paper which provides some of the philosoph-
ical basis of Bayesian statistics. In 1765, he was elected a Fellow of the Royal
Society in recognition of his work on the legacy of Bayes. By modern standards,
we should refer to the Bayes–Price rule. Price discovered Bayes’ work, recog-
nized its importance, corrected it, contributed to the article, and found a use
for it.
4
Suppose one proposes a model H (hypothesis). If H is a predictive model it
yields a method for the computation of the forward probabilities P (E j H) : The
existence of a method for the computation of the backward probabilities is less
clear. Bayes Theorem ensures that if the former probability can be computed,
then the latter probability can be also computed. Bayes Update Rule (Bayes
Theorem) estates that
P (E j H) P (H)
P (H j E) = ;
P (E)
and it is widely used in probabilistic inference to update the probability of
a hypothesis as new information is collected. Let us consider an elementary
example of this application.
Suppose there are two full bowls of cookies. Bowl 1 has 10 chocolate chip
and 30 plain cookies, while bowl 2 has 20 of each. Our friend Fred picks a bowl
at random, and then picks a cookie at random. We may assume there is no
reason to believe Fred treats one bowl di¤erently from another, likewise for the
cookies. The cookie turns out to be a plain one. How probable is it that Fred
picked it out of bowl 1? Here we have an hypothesis: Bob chose bowl 1: And
we also have data: the chosen cookie is plain. Now, we want to evaluate the
probability of the hypothesis given the data. It seems clear that the answer
should be more than a half, since there are more plain cookies in bowl 1. The
precise answer is given by Bayes’theorem:
Let Hi be the hypothesis: the chosen bowl is bowl number i: We assumed
that the bowls are identical from Fred’s point of view, and it implies that
1
P (H1 ) = P (H2 ) = :
2
Both hypothesis are predictive, they allow us to compute the forward prob-
abilities (probability of picking one of the two possible types of cookies). If E
is equal to - the chosen cookie is plain - we get that
3 1
Pr (E j H1 ) = and Pr (E j H2 ) = :
4 2
Notice that P (E) is a marginal probability: the probability of event E given
null information about the additional variables. This marginal probability can
be hard to compute. However, this is not the case with our elementary example.
We have that
3 2 5
P (H1 ) Pr (E j H1 ) + P (H2 ) Pr (E j H2 ) = + = :
8 8 8
We get from Bayes rule that
Pr (E j H1 ) P (H1 ) 3
Pr (H1 j E) = = :
P (E) 5
We can conclude that H1 is more likely than H2 (given the data) since:
2 3
Pr (H2 j E) = < < Pr (H1 j E) :
5 5
5
1.3 Decoding
Let v; w 2 FN
q : The Hamming distance between those two strings is equal to
Suppose that w is the vector received by Bob after transmission, and let v be
the codeword that minimizes the Hamming distance to w. The codeword v is the
most likely message. Thus, the above problem, or some relevant subproblem, is
the problem that we have to solve if we want to decode the messages transmitted
by Alice. We have to remark that this problem is NP-hard [3]. The latter means
that we cannot e¢ ciently process all the instances of this problem. We notice
that this is not the end of the story, given that we do not want to decode all
the possible codes, we want to decode the codes that matter. We have to ask:
what are the codes that matter? The latter question is a hard question which
lacks a de…nitive answer, this question is one of the foundational question of
Information Theory.
References
[1] T. Bayes. An Essay towards solving a Problem in the Doctrine of Chance.
By the late Rev. Mr. Bayes, communicated by Mr. Price, in a letter to John
Canton, A. M. F. R. S. Philosophical Transactions of the Royal Society of
London 53: 370–418, 1763.
[2] W. Kurt. Bayesian Statistics the Fun Way: Understanding Statistics and
Probability with Star Wars, LEGO, and Rubber Ducks. No starch Press,
San Francisco, 2019.
[3] E. Berlekamp, R. McEliece, and H Van Tilborg. On the inherent intractabil-
ity of certain coding problems. IEEE Trans. Inform. Theory, 24: 384–386,
1978.
6
2 Uncertantity and Entropy
In the latter lecture we entered the world of information theory. We will stay
in this world for a while. We want to know which are the linear codes that is
worth to decode, and we want to know if we can decode those codes e¢ ciently.
To answer the former question we will need the fundamental notion of entropy.
De…nition 6 Let X be a random variable distributed over the set f1; :::; ng with
probabilities fp1 ; :::; pn g : Shannon’s Information Content of event i is equal to
log p1i :
We observe that the above de…nition makes sense: the information content
of event i is directly proportional to the surprise that it could produce. Let X
be the random variable de…ned by
7
De…nition 7 Let X1 ; :::; Xn be a tuple of random variables, which are jointly
distributed over a …nite set A1 An . The (join) entropy of the tuple
(X1 ; :::; Xn ) is equal to
X
H (X1 ; :::; Xn ) = Pr [X1 = a1 ; :::; Xn = an ] log (Pr [X1 = a1 ; :::; Xn = an ]) :
(a1 ;:::;an )
8
De…nition 15 An injective function c : f1; :::; ng ! f0; 1g is a pre…x-free
code, if and only if, there does not exist i; j 2 f1; :::; ng such that c (i) is a pre…x
of c (j) :
An optimal code for the distribution fp1 ; :::; pn g is a code that minimizes
this length.
where H (p1 ; :::; pn ) is the Shannon entropy of this distribution (random vari-
able).
9
Proof. Let L l1 ; :::; ln . There are exactly 2L wordsX
of length L: Let us say
L
that all those words have weight 2 . We have that weigth (w) = 1: Let
jwj=L
wi ; wj be two code-words of lengths li and lj . We have that those two words do
not have common heirs, and we have that wi has 2L li heirs of length L: Then,
we have
X
1 weigth (w)
w is heir of a co de-word
X X
= 2L li
2 L
= 2 li
:
i n i n
Exercise 19 Let fw1 ; :::; wn g be a binary pre…x code, prove that the average
length of this code is bounded below by H (p1 ; :::; pn ) :
Theorem 20 There exists a pre…x code for n whose average length is bounded
above by H (p1 ; :::; pn ) + 1:
The proof of the above theorem allows us to give the term log (pi ) an impor-
tant and useful interpretation: log (pi ) is equal to the number of bits that are
used to encode i in an optimal encoding of n : The entropy of this probability
distribution is the average length of this optimal encoding.
References
[1] C Shannon. A Mathematical Theory of Communication. Bell System Tech-
nical Journal 27: 379–423, 623–656, 1948.
10
3 Data Compression and Shannon’s Source Cod-
ing Theorem
Let us go back with Alice’s Gran. Recall that she went to the post o¢ ce in
order to send a telegram. There, in the post o¢ ce, she met the operator of the
telegraph, a lovely young guy who wants to be as helpful as possible.
The operator observed that each character has a characteristic frequency
within the text. Let us use the symbol pi to denote the frequency of charac-
ter i: The set fp1 ; :::; pn g is a probability distribution over n ; and it is the
probability distribution that matters for this message. The operator decides to
try an e¢ cient encoding of Gran’s message. Given the empirical distribution
fp1 ; :::; pn g ; the operator can look for a pre…x-free code of the alphabet n : It
is for sure that he can …nd a pre…x-free code c such that the average length of c
is bounded above by H (p1 ; :::; pn ) + 1: Let w be the message written by Alice’s
Gran, if the operator uses b c he gets a codeword cw for which the inequality
2. The above coding scheme does not work in real-time. Coding with this
scheme requires two passes over the string w: In the …rst pass one com-
putes the distribution fp1 ; :::; pn g ; and in the second pass he computes the
codeword cw : We ask: does there exist a real-time coding scheme achieving
a similar upper bound?
Exercise 22 Prove that the compression ratio of the coding c; discussed in pre-
vious section, is bounded above by
H (p1 ; :::; pn ) + 1:
11
Exercise 23 Prove the …rst item.
We focus on the second item, which has an existential (constructive) char-
acter. We have to ask: what does it mean that ratio H (p1 ; :::; pn ) can be
asymptotically achieved?
We can imagine that Alice’s Gran is not the author of a single message, but
an information source with distribution fp1 ; :::; pn g : The latter means that:
If Alice’s Gran is in mode message, she chooses at random a length m and a
message u 2 m n : String u is chosen character by character. The i-th character
of u is chosen from n according to distribution fp1 ; :::; pn g :
Let w be the original message written by Alice’s Gran, and which motivated
all this discussion. We have that w is just one of the in…nitely many messages
that can be written by this lovely old woman. Thus, we can try to construct a
suitably code for this information source and not only for w: If we do the latter
we will have to code arbitrarily long strings, and hence it will become possible
to asymptotically achieve Shannon’s Limit: the information ratio achieved for
long strings approaches Shannon’s Limit.
Remark 24 We would like to remark that w is a typical string for the above dis-
tribution: the empirical frequency of character i; which is de…ned as jfk jwj:w[i]=kgj
jwj ;
is exactly equal to pi :
We prove, in next section, that it is possible to asymptotically achieve the
ratio H (p1 ; :::; pn ) using block codes of increasing length.
N
De…nition 25 A block code of length N is a pre…x-free embedding of n into
f0; 1g :
Remark 26 Suppose M > N: One can use a block code of length N to code
strings of length M: Suppose w 2 M n ; one can factor this string as a sequence
of blocks of length N plus a single factor (the rightmost factor) whose length is
upperbounded by N: Suppose that w is factored as u1 uk uk+1 , and suppose
that juk+1 j < N: The codeword assigned to w is the string
b
c (w) = c (u1 ) c (uk ) e
c (uk+1 ) ;
where ec is a second code that is applied to the strings whose length is strictly
lesser than N: We are not interested in the code e c; and we will focus on the
behavior of c:
12
De…nition 27 Suppose we have an information source with distribution fp1 ; :::; pn g ;
and let c : N
n ! f0; 1g be a binary block code for this source.
N
1. We say that c is a lossless code, if and only if, given w 2 n one can
deterministically recover w from its code c (w) :
2. Let " > 0: We say that c is "-lossy, if and only if, the inequality
Pr [One can correctly compute w from c (w)] 1 "
w2DN N
n
N
holds, where DN is the probability distribution over n that is determined
by the information source under consideration.
Let S be an information source with probability distribution fp1 ; :::; pn g :
Shannon’s probabilistic argument allows us to show that for all " > 0 there ex-
ists a "-lossy code whose information ratio is bounded above by H (p1 ; :::; pn )+":
How can we prove such a thing? A general strategy in problem solving con-
sists in reducing a complex problem to a subproblem that is completely consti-
tuted by easy instances. Think in the solution of linear systems, we reduce the
non-triangular instances to the triangular instances using Gaussian elimination.
Thus, let us ask: for which information sources is it easy to construct a code
achieving Shannon’s Limit?
Exercise 28 Suppose that n is a power of two and suppose that fp1 ; :::; pn g is
log(n)
the uniform distribution. Fix an injection c : n ! f0; 1g : Prove that the
information ratio of the pre…x-free code determined by c is equal to H (p1 ; :::; pn ) :
Exercise 29 Suppose that n is not a power of two, but suppose that the proba-
bility distribution is still uniform. Show that Shannon’s Limit is asymptotically
achieved using lossless codes.
We have that Shannon’s Limit can be easily achieved in the uniform case.
Can we reduce the general case to the uniform one?
Let D = fp1 ; :::; pn g be the probability distribution of our information source
and let H (D) be its Shannon entropy. Given N; we use the symbol DN to denote
the probability distribution that is induced by D over the set N n : This latter
probability distribution is de…ned by the equality
Pr [w] = pN
1
1
pN
n ;
n
13
and this latter quantity is close to 2 N H(D) (for all i n the ratio N
N is close
i
to pi ):
The above observation motivates the following precise de…nition of typical
strings.
De…nition 30 Let > 0; the set of -typical strings of length N is the set
N 1 1
TN = w2 n : log H (D) < :
N Pr [w]
Remark 31 Notice that the typical strings are the strings whose information
content is close to the expected value N H (D) : The law of large numbers tells
us that DN will concentrate, as N grows, around those strings.
Exercise 34 Choose a suitably statement for The Source Coding Theorem and
1
prove it. Let us suggest for the value p
5
N
:
14
3.2 A Practical Coding Scheme: Lempel-Ziv
We know that we can achieve better information ratios if we code long blocks
instead of characters. We do the latter in the previous sections, we studied the
existence of block codes that achieve better ratios than the ratios achieved by
coding schemes based on the coding of characters. The block codes considered in
previous sections force us to take into account the probability distribution of the
source. If we are given a single message, and we have to compute the empirical
frequencies in this string, we cannot compute the code in real-time. We ask:
which information ratios can be achieved using real-time block coding? Or,
equivalently: what information ratios can be achieved using block coding but
ignoring the probability distribution of the source? If we ignore the probability
distribution of the source, we are necessarily using a coding strategy that works
for any source, and which can be considered as universal. We claim that being
universal implies being either lossless or ine¢ cient. Let us study a lossless
universal scheme invented by Abraham Lempel and Jakob Ziv [1], and which is
the algorithmic core of most of the commercial software that is used for data
compression.
v 0 j v1 j j vc(n) :
w = AABABCBABAABACBBABBACC:
The parsing start with the shortest string on the left we have not seen before.
Notice that the 0-th phrase is always the empty string ": We continue with the
shortest phrase on the left we have not seen before. We get
" j A j ABABCBABAABACBBABBACC:
We continue with the shortest phrase on the left we have not seen before.
We get
" j A j AB j ABCBABAABACBBABBACC:
Continuing in this way we get
Now, we have to describe this parsing using the alphabet f0; 1g. We begin
by observing that all the phrases that occur in this parsing are di¤erent, and
that each one of those phrases is equal to a previous phrase plus a character on
15
the right (all but the 0-th phrase). Thus, given i 1; there exists a single j < i
such that vi is equal to the vj xk(i) ; where xk(i) is a letter in n : Observe that vi
can be fully described with two data: the binary code of j and the binary code
of xk(i) :
Remark 36 We can always …x a binary coding of n : It is enough to …x a
pre…x-free code that injectively assigns to each letter a binary string of length
log (j n j). If n = 3 and 3 = fA; B; Cg we can …x the pre…x-free code
b = 00; B
A b = 01 and C
b = 11;
16
Exercise 40 Set
m
Y
Q (w) = pw[i] :
i=1
Set
cl = jfi c (m) : jyi j = lgj ;
we have
holds.
m
log Q (w) c (m) log m O log
c (m)
holds.
Proof. We have
X X cl cl
cl log cl = c (m) log c (m) + c (m) log :
c c
l l
17
X X
cl cl cl cl
The last term is c (m) times c log c ; and c log c is the entropy
l l
of the distribution
cl
(l) = :
c (m)
X
cl
The expected value of this distribution is l c(m) ; and it is equal to m: We get
l
that
m
log Q (w) c (m) log m O log ;
c (m)
and the lemma is proved.
Now, we can prove that Lempel-Ziv is asymptotically e¢ cient
b
jwj m H (p1 ; :::; pn ) + o (m)
holds.
We have that Y
Q (w) = pm
i ;
i
i n
and by the law of large integers we have that mi is approximately equal to mpi :
Thus, taking logs we get that
X
log Q (w) = m pi log pi = m H (p1 ; :::; pn ) ;
i n
m
m H (p1 ; :::; pn ) + O log c (m) log c (m) :
c (m)
18
References
[1] J. Ziv, A. Lempel. Compression of individual sequences via variable-rate
coding. IEEE Transactions on Information Theory 24 (5): 530-541, 1978.
19
4 Communication Over Noisy Channels and Shan-
non’s Channel Coding Theorem
Alice’s gran and the young operator agreed on a coding scheme. Let us suppose
that they chose to use Lempel-Ziv. They used this scheme to compute a short
binary codeword for the message written by Alice’s Gran using the alphabet n .
Now they want to transmit this binary codeword using the telegraphic channel.
So far we have supposed that the binary channel to be used is a noiseless channel.
This latter assumption is unlikely.
Remark 45 Take into account that, after compression, the information content
of Alice message is distributed over w in a complex way. The latter implies
that the receiver wont have the possibility of recover the corrupted bits only by
observing the adjacent characters and doing some inferences, (as it is usually
done with intelligible but corrupted messages written in natural language).
20
telegraphic channel, and communicate him those two codes. In real world those
two algorithms are carefully chosen by the administrator of the communication
network, and unconsciously used by most users of the network.
Example 48 Let X the random variable that models the choices of Alice’s
Gran, and let Y be the random variable that models the choice of the binary
strings that are seen by the receiver. The pair (X; Y ) is a pair of random vari-
ables that are jointly and not independently distributed.
Let us consider three examples of noisy channels. The …rst one is The
Typewriter Channel. The second one is the Symmetric Binary Channel (SB-
Channel, for short), and the third one is The erasure Channel.
The typewriter channel is de…ned by:
1. The source and the channel alphabet are both equal to f0; 1; :::; 26g :
2. The probability distribution d is de…ned by
8 1
< 3 ; if the distance between x and y is less than 2;
P (x j y) = or fx; yg = f0; 26g :
:
0, otherwise.
1. The source and the channel alphabet are both equal to f0; 1g :
21
2. The probability distribution d is symmetric, that is: there is a parameter
f , called noise, such that
P (0 j 1) = P (1 j 0) = f and
P (0 j 0) = P (1 j 1) = 1 f:
Pr [a becomes equal to ?] = e:
Pr [a does not become equal to ?] = 1 e:
22
Exercise 54 Consider the typewriter channel. Prove that the probability distri-
bution of the source that maximizes the information conveyed by this channel is
the uniform distribution. Use the latter to compute the capacity of the channel.
The Channel Coding Theorem estates that the capacity of C is the supremum
of the information ratios that can be reliably achieved for this channel. Let us
estate this theorem precisely.
N
De…nition 55 A N -block code for the channel ( ; ; d) is a set S : The
information ratio of S is equal to
log (jSj)
>C "
N log (j j)
holds, while the probability of block error is kept below ": Moreover, if the
information ratio R is asymptotically achieved for the channel ( ; ; d) with a
negligible probability of block error, then the inequality R C must hold.
We can easily con…rm the theorem in the case of the typewriter channel. We
can create a completely error-free communication strategy using a block code
of length N = 1: we use only the letters B; E; H; :::; Z, i.e., every third letter.
These letters form a non-confusable subset of the input alphabet. Thus, any
output can be uniquely decoded. The number of inputs in the non-confusable
subset is 9, so the error-free information rate of this system is
log (9) 2
= ;
log (27) 3
23
4.3 Sketch of the Proof of Shannon’s Theorem
Let ( ; ; d) be an arbitrary channel. We prove something weaker, we prove
that the information ratio of any lossless code is bounded above by the capacity
I (X; Y ) : To prove the latter we consider the extended channel that corresponds
N
to N uses of the original channel. This extended channel has j j possible
N
inputs and j j possible outputs. We make use of large block length. If N is
large, the extended channel looks a lot like the noisy typewriter: any particular
input w is very likely to produce an output within a small subspace of the
output alphabet. So, we can try to …nd a non-confusable subset of the inputs
that produce essentially disjoint output sequences.
Let N be a large positive integer. The number of typical output sequences is
2N H(Y ) , all having similar probability. Given a typical input sequence w, there
are about 2N H(Y jX) probable output sequences. Suppose that the channel is
not very noisy. Then, the latter quantity becomes strictly smaller than the
former, and the gap between those two quantities grows as N grows. Then, we
can imagine taking a large enough N and restricting ourselves to a subset of the
typical inputs such that the corresponding typical output sets do not overlap.
The maximum number of non-overlaping subsets of size 2N H(Y jX) ; included in
a set of size 2N H(Y ) , is bounded above by the ratio
2N H(Y )
= 2N (H(Y ) H(Y jX))
2N H(Y jX)
= 2N (I(X;Y ))
:
References
[1] D. MacKay. Information Theory, Inference, and Learning Algorithms. Cam-
bridge University Press, Cambridge, 2003.
24
5 Good Codes For The Symmetric Binary Chan-
nel
A linear block code of length N and dimension k is a linear embedding c : Fk !
FN : Code c can be suitably represented by a matrix Gc that we call the generator
matrix of the code, and which is nothing more than the matrix representation
of the linear map c : matrix Gc is equal to
c (e1 ) c (ek ) ;
where fe1 ; :::; ek g is the canonical basis of Fk : Notice that the codewords are the
linear combinations of the columns of Gc ; and we can identify the code either
with the linear map c or with the range of c: By an abuse of notation we use
the symbol c to denote the latter vector subspace of FN : Notice that the vector
space c can also be described by a matrix whose rows (columns) constitute a
basis of its orthogonal complement.
Remark 58 We de…ne the orthogonal complement of c; the set c? ; as the set
v 2 FN : for all w 2 Range (c) the equality hv; wi = 0 holds ;
where hv; wi is de…ned as
v [1] w [1] + + v [N ] w [N ] ;
with the arithmetical operations computed in the ground …eld F2 . We have that
set c? is a vector subspace of dimension N k : the dimension of c? complements
the dimension of c: However, it is not longer true that c and c? have an empty
intersection: suppose that N is an even integer, notice that the equality hv; vi = 0
?
holds for all v 2 FN 2 ; and notice that it implies the equality hvi \ hvi = hvi :
Thus, we should use the term orthogonal pseudo-complement of c to refer the
set c? : We we will stick, by an abuse of language, with the term orthogonal
complement to designate the set c? .
Exercise 59 Prove that the equality
?
dim (Range (c)) =N dim (Range (c))
actually holds.
Let w1 ; :::; wN k be a basis of c? ; and let
2 3
w1
6 .. 7
P Cc = 4 . 5
wN k
be the N k N matrix whose rows are those basis vectors. We have that
w 2 Range (c) ; if and only if, the equality
P Cc w = 0
25
holds. We say that P Cc is a parity check matrix for c: We have that a linear
block code c can be completely described by a parity check matrix.
Suppose we are given the matrices Gc and P Cc ; it is quite straightforward
to:
1. Encode a message w 2 Fk : the code of w is equal to Gc w:
2. Recognize the codewords: u is a code word, if and only if, P Cc w = 0:
Decoding is not so easy. Consider the problem
It is known that the above problem is NP-hard, and it implies that optimal
decoding of binary linear codes is not feasible
Remark 61 If we …x a linear block code c (a matrix H) we get a problem with
a …nite number of instances. Such a problem cannot be NP-hard, such a problem
can be solved in time O (1) : Then, what is the problem?
First at all, we have to say that we cannot simply …x a code like The Ham-
ming Code. We are obliged to aspire to greater and greater code lenght. Suppose
we are stubborn and we insist on a speci…c code length. This code length has to
be very large in order to be useful. Then, we get a …nite problem that is con-
stituted by a …nite and large number of hard instances of a NP-hard problem.
We get almost the same conclusion: the maximum likelihood decoding of binary
codes is not practical for any useful size.
There are sub-optimal techniques based on message passing (see below) and
which give excellent results that can be practically implemented. We will study
those decoding techniques in a future lecture. Let us focus for a while on
the following question: are there good linear codes for the symmetric binary
channel? If we want to cope with the latter question, we will have to give an
answer to the following question: what is a good code?
Let us try an answer to the above question. A good code for a symmetric
binary channel SB (F ) is a code c such that:
1. It is easy to encode messages using this code.
2. The information ratio of the code approaches Shannon’s limit.
3. The code has good error correction capabilities (for the channel f ).
4. Practical decoding is possible.
We know that the …rst property holds for any linear code. It remains to
investigate wether there exist linear codes for which the three additional prop-
erties also hold.
26
5.1 Reliability and Minimum Distance
Let f < 21 : Shannon’s Channel Coding Theorem tell us that we can reliably
communicate over the channel SBC (f ) at any transmission rate that is below
the capacity of the channel, which is equal to 1 H2 (f ) ; (here the symbol
H2 (f ) denotes the binary entropy of ): Then, if we want to prove that there
exists good codes for the symmetric binary channel, will have to prove that given
f < 12 and given " > 0 there exists a code whose information ratio is greater
than 1 H2 ( ) " and which can be used for reliable communication over the
channel SBC (f ) : Let us ask: how can we ensure that a code is reliable?
De…nition 62 Let c FN
2 be a code, the minimum distance of c is equal to
Suppose that Alice uses SBC (f ) to transmit N bits. Let K be the number
of bits incorrectly transmitted. The expected number of those bits is equal to
f N: However, if N is small the ratio fKN could be far away from 1 with a high
probability. The law of large numbers ensures that the latter does not occur
N
when we make N goes to in…nity. Thus, suppose that N is large, let w 2 f0; 1g
be the transmitted message and let u be the received message. The Hamming
distance dH (w; u) is equal to the number of errors occurred along transmission.
Let " > 0: We have that dH (w; u) is less than (f + ") N with high probability,
and if N goes to in…nity this probability goes to 1: Now suppose that we are
using a block code of length N; and suppose that the minimum distance of c is
greater than (2f + ") N for some " > 0: We have:
We get that:
Let c be a code of length N and minimum distance greater than 2f N: If
N is large, the code can be used for reliable communication over the channel
SBC (f ) :
Notice that our quest for good codes reduces to show that given f < 21 ;
given a positive integer N and given " > 0 there exists a code of length N;
27
whose minimum distance is greater than 2f N and whose information ratio is
greater than 1 H2 (f ) ":
We show, in the last lecture of this booklet, that the latter can be done for
small values of f . We show that a random construction works: if we choose
the code at random, from a suitable ensemble, we get a good code with high
probability.
1. LDPC codes are capacity-approaching codes, which means that there exist
practical constructions of LDPC codes that allow the noise threshold to
be set very close to the theoretical maximum (the Shannon limit).
2. LDPC codes can be probabilistically decoded in linear time using message
passing.
We will study LDPC codes for a while. We focus, in the remainder of the
chapter, on the above …rst item.
LDPC codes are also known as Gallager codes, in honor of Robert G. Gal-
lager, who developed the LDPC concept in his doctoral dissertation at the
Massachusetts Institute of Technology in 1960 [1]. LDPC codes were forgotten
until Gallager’s work was rediscovered in 1996 by MacKay and Neal [2]. LDPC
codes became, after the work of MacKay, the coding scheme of choice.
Let c FN2 be a linear code, and let P Cc be a parity check for this code:
We can think of P Cc as the adjacency matrix of a bipartite graph constituted
by two sets of nodes: a set representing the transmitted bits, and another set
representing the constraints that the transmitted bits have to satisfy. Let
28
where the adjacency relation is de…ned by
If P Cc is a sparse matrix with few ones per column and few ones per row,
then the graph T Gc is a sparse graph. We say in the latter case that c is a
sparse graph code. Coding theory is based, nowadays, on sparse graph codes
achieving close to the Shannon limit. The archetypal sparse graph codes are
Gallager’s low-density parity-check codes.
Remark 66 The graph T Gc is called the factor graph of c. Factor graphs con-
stitute one of the many types of Tanner graphs that are related to codes. Tanner
graphs are named after Michael Tanner, who introduced them as a means to
create larger error correcting codes from smaller ones [3].
Proof. Suppose the S has u unique neighbors. The number of edges emanating
from S is equal to
t jSj u + (jSj u) 2;
D
which gives u 2 D> 2:
29
Lemma 69 The minimum distance of cG is at least N:
Proof. We show that each codeword has at least N ones. Suppose that this
is not the case. Let w 2 cG be a string with less than N ones. Let
Iw = fi 2 U : w [i] = 1g :
D D
We have that Iw has more than 2 unique neighbors. We get that at least 2
parity checks are not satis…ed.
N R 8e2 log 1 N 1
=1 1 8e2 log :
N N
If is small, the term on the right approaches the quantity 1 H2 (f ) : We get
that the existence of arbitrarily large (t; ; )-expanders ensures the existence
of good codes for SB (f ) (when f is small). It remains to prove that given those
parameter values there exists (t; ; )-expanders that are as large as desired.
1
Exercise 71 Suppose that f 16 : Set = 2f; t = log 1 and R =
2
8e t N : Pick a random t-regular bipartite graph with N bit nodes and R
parity check nodes by connecting each bit node to t random neighbors. Prove
that the probability that the constructed graph is a (t; ; )-expander is bounded
below by 21 :
References
[1] Gallager, R. G. (1963) Low Density Parity Check Codes. Number
21 in MIT Research monograph series. MIT Press. Available from
www.inference.phy.cam.ac.uk/mackay/gallager/papers/.
30
[2] D. MacKay, R. Neal. Near Shannon Limit Performance of Low Density Par-
ity Check Codes. Electronics Letters, July 1996.
[3] M. Tanner. A recursive approach to low complexity codes. IEEE Transac-
tions on Information Theory 27(5): 533-547, 1981.
31
6 Constrained Noiseless Channels And Trellises
In this lecture we study a completely di¤erent type of channels, the so called
Constrained Noiseless Channels. Those channels will lead us to the important
notion of trellis codes, which will be instrumental in the decoding of good codes
for the symmetric binary channel.
We can begin with a constant length code that maps each source bit into
two transmitted bits. We can, for instance, map 0 to 00 and 1 to 10: This is a
code of ratio 21 , and it respects the constraints of the channel. We get that the
capacity of the channel is at least 0:5. Can we do better?
The above code is redundant. We know that the second bit will always be
a zero. We can achieve a smaller average transmitted length using a code that
omits the redundant zeroes. We can try a variable length code that maps 0 to
0 and 1 to 10: If the source symbols are used with equal frequency then the
average transmitted length per source bit is 23 : Can we do better?
A second, more fundamental approach counts how many valid sequences of
length N there are. Let SN be the latter quantity. We can communicate log (SN )
bits in N channel cycles by giving one name to each of these valid sequences.
This approach allows us to achieve the capacity of the channel. Then, if we want
to estimate this capacity we should consider the following counting problem:
Problem 73 #CNS
32
The above problem lead us, in turn, to the problem of counting paths over
trellises.
33
The set of nodes of T (M ; q ; n) is the set QM f0; :::; ng.
The pair ((p; i) ; (q; j)) is a directed edge of T (M ; q ; n) labeled with the
letter a; if and only if, j = i + 1 and M (p; a) = q:
Let L ( ) be the set of sequences that are valid for . We have that L ( ) is
equal to L (M ) ; and we have that this latter set is in bijective correspondence
with the set of n-length paths included in T (M ; q ; n) that start at (q ; 0) :
Thus, we naturally arrive to the following counting problem:
1. If you are the front soldier in the line, say the number ‘one’to the soldier
behind you.
2. If you are the rearmost soldier in the line, say the number ‘one’ to the
soldier in front of you.
3. If a soldier ahead of or behind you says a number to you, add one to it,
and say the new number to the soldier on the other side.
The commander can compute the global number of soldiers by simply adding
together the number said to him by the
soldier in front, and the number said to him by the soldier behind. This
clever trick makes use of a profound property of the total number of soldiers:
this quantity can be written as the sum of the number of soldiers in front of a
34
point, and the number behind that point, two quantities which can be computed
separately, because the two groups are separated by the point.
If the soldiers were not arranged in a line but were travelling in a swarm,
then it would not be easy to separate them into two groups in this way. If the
connection graph of this swarm contains cycles, it is not possible for an agent
in a cycle to separate the group into two groups, ‘those in front’, and ‘those
behind’. A swarm of agents can be counted by a modi…ed message-passing
algorithm provided that the agents are arranged in a graph that contains no
cycles.
35
6.4.2 Viterbi (Min-Sum) Algorithm
The algorithm that we study in this section is Viterbi’s Min-Sum Algorithm [2].
Suppose we are given a weighted digraph (G; w) ; and suppose that the weight
assigned to each edge corresponds to the cost of crossing it. Suppose also that
we are interested in computing cheapest paths in G, that is: suppose that we
are interested in e¢ ciently solving the following problem.
and transmit this message along all the edges that go out from v:
Exercise 85 Prove that the above tropical algorithm allows us to compute min-
imum distances in weighted digraphs.
36
Show that the integral image I (x; y) can be e¢ ciently computed by message
passing. Show that, from the integral image, some simple functions of the image
can be obtained. For example, give an expression for the sum of the image
intensities f (x; y) for all (x; y) in a rectangular region extending from (x1 ; y1 )
to (x2 ; y2 ).
References
[1] J. Pearl. Reverend Bayes on inference engines: A distributed hierarchical
approach. Proceedings of the Second National Conference on Arti…cial Intel-
ligence AAAI pages 133–136, 1982.
[2] A. Viterbi. Error bounds for convolutional codes and an asymptotically op-
timum decoding algorithm. IEEE Transactions on Information Theory 13 (2):
260–269, 1967.
37
7 Decoding of Trellis Codes
Suppose that Alice and Bob are communicating over a noisy channel SB (f ),
and suppose that they agreed on a binary N -block code c: Let S 2N be the
b the received
set of codewords. Suppose Alice transmits a string w 2 S; and let w
string. Bob has to solve the following task: to compute the codeword w that
maximizes the quantity
b
Pr [ec (u j w)] = Probability that u was the transmitted string, given that
b is the string that we (Bob) received.
w
b j u)]
Pr [rc (w b is the message that Bob will receive, given that
= Probability that w
u is the string that we (Alice) chose to transmit.
If the code c is a good code, then the set of codewords is a large set, and it
becomes unfeasible to compute the latter sum. Fortunately, we are not forced
to compute the above sum, and the corresponding probability, given that we are
assuming that w b is the input of our problem, and we are looking for the u that
…ts wb the best. Thus, we focus on the numerator Pr (rc (w b j u)) Pr [u] : If we
assume that Alice chose the message uniformly at random from S; we get that
1
for all u 2 S the equality Pr [u] = jSj holds. Then, under the latter assumption,
we can dispense us of consider the factor Pr [u], and our problem reduces to the
following one:
38
Notice that we can solve the above problem using a brute force algorithm: we
compute Pr (rc (w b j u)) for all u in S; and then we choose the u that maximizes
the latter quantity. However, this approach is unfeasible for most codes. We
ask: can we do better than brute force? The bad news is that it is not always
possible to do better given that the decoding problem is NP-hard. The good
news is that it is possible to do a good job for a large family of codes.
b We can assign
Suppose we are given as input the trellis T and the string w:
to edge ((p; i) ; (q; i + 1)) the cost
b [i] j t)]) ;
log (Pr [r (w
where t is the label of this edge and Pr [r (wb [i] j ti )] denotes the probability of
b [i] after transmitting bit t:
receiving bit w
Exercise 92 Why did we assign weight log (Pr [r (wb [i] j t)]) to edge ((p; i) ; (q; i + 1))
b [i] j t)]?
instead of the simpler weight Pr [r (w
After assigning the above weights to the edges in T we get a weighted trellis
that we denote with the symbol Twb :
Exercise 93 Shows that the output of Viterbi’s algorithm, on input Twb , corre-
sponds to the weight of the path in Twb that represents the solution of The MAP
Codeword Decoding Problem for the input (cT ; w) b :
Exercise 94 Viterbi’s algorithm, as de…ned in the previous section, allows us
to compute optimal weights. Now we want to compute optimal weights as well
as the paths achieving those weights. Your …rst work is to design a variant of
Viterbi’s algorithm for the computation of optimal paths in weighted trellises.
Your last work is to use the previous algorithm in the design of a decoding
algorithm for trellis codes.
39
References
[1] F. Kschischang, B. Frey, H. Loeliger. Factor graphs and the sum-product
algorithm. IEEE Trans. Info. Theory 47 (2): 498-519, 2001.
40
8 Trellis Representations of Linear Codes
Gallager proved that good LDPC codes can be constructed at random. This
work was neglected for long time because he did not provide decoding algorithms
for those codes. The existence of e¢ cient decoding algorithms for those codes
is related to the following fact:
If we can e¢ ciently construct trellis representations of LDPC codes, then
we can use those representations and Viterbi’s algorithm for e¢ cient decoding.
In this lecture we study the existence and construction of trellis representa-
tions of linear block codes.
(V0 t V1 t t VN ; E)
that satis…es the following (sequential) condition: for all (v; w) 2 E there exists
i < N such that v 2 Vi and w 2 Vi+1 : We say that N is the length of the trellis,
and we think of the interval f0; 1; :::; N g as the time axis of the trellis.
f(w; i) : w 2 Sg :
41
Trellis (GS ; L) appears when one naively represent each string of S as a
path. Naive representations use to be useless, and this is the case with the
trellis (GS ; L) : The problem with this trellis is that it corresponds to a dilated
representation of the code: we have that jGS j = N jSj : Then, if S is large, the
trellis (GS ; L) is even larger. The trellis representations that can be of use are
the concise representations. What does it means concise?
On one hand we have that any linear code of length N can be represented by a
matrix of size O N 2 (linear codes have algebraic representations of polynomial
(quadratic) size). And, on the other hand, we have that representations of
superpolynomial size yield algorithms that run in superpolynomial time. We
get a conclusion: a concise representation of a block code of length N is a
representation whose size is polynomial in N: We ask: let c : Fk2 ! FN 2 be
a linear code, can we e¢ ciently construct a trellis representation of c of size
O N d ; for some …xed d that does not depend on the code?
42
1. cP (0) = cS (0) = f0g and cF (0) = c:
2. cP (N ) = fv 2 c : v [N ] = 0g ; cS (N ) = F2 and cF (0) = f0g
3. Each codeword w determines a sequence
wSi+1 = ui+1
S
must hold.
We get, from the above exercise, that we can construct a trellis representation
of c using the vector spaces
Exercise 100 Prove that there exists a bijective correspondence between the
sets of codewords and the set of traverses of the above trellis.
We use the symbol T (c) to denote the above trellis. Notice that T (c) is
completely determined by the linear structure of the code c: Is this representa-
tion small? Notice that the i-th …ber has size 2dim(cS (i)) ; and it suggests that
T (c) is a large trellis with …bers of superpolynomial size: We have:
Theorem 101 Let M be a trellis representation of c; and let M (i) be the i-th
…ber, the inequality jM (i)j 2dim(cS (i)) holds.
The latter assertion is called The State Space Theorem (see [1]), and it
implies that T (c) is a minimal trellis representation of the linear block code c:
We conclude that: if the code c has a small trellis representation, then T (c) is
small.
43
8.3 Construction of Minimal Trellises
We know, from previous section, that T (c) is the minimal trellis representation
of c. Can we construct this representation e¢ ciently?
Let i N: One can use basic linear algebra to compute the decomposition
cP (i) cF (i) cS (i) (to compute a basis of each one of the vector spaces
cP (i) ; cF (i) and cS (i)). The latter means that one can e¢ ciently compute
the …bers, even if those …bers are large. Then, the only thing that remains
to be done is to compute the transitions. We can solve the latter task using
a brute force approach: list the codewords, and represent each codeword as a
labeled path over the trellis T (c) : We know that brute force solutions use to
be unfeasible, and this is the case with the above solution: if c is a good code,
its dimension k is large, and the set of codewords is a large set of size 2k : Thus,
we have to try a di¤erent approach. Observe that computing the transitions
reduce to decide, of each possible triple ((x; i) ; (y; i + 1) ; a) ; wether this triple
is a transition or not. If the …bers are small, the number of those triples is also
small, and the work can be done triple by triple. The latter means that we can
focus on the following problem:
Let c be a linear block code of length N; let i N; let x 2 cS (i) ; let y 2
cS (i + 1) and let a 2 f0; 1g : Decide if there exists w 2 c such that w [i + 1] = a
and such that wSi = x and wSi+1 = y:
Let (x; y; i; a) be an instance of the above problem, deciding if the latter
instance is a positive instance corresponds to decide wether there exists w 2 c
such that:
1. 9u 2 cP (i) 9v 2 cF (i) (u + v + x = w) :
2. 9r 2 cP (i + 1) 9s 2 cF (i + 1) (r + s + y = w) :
3. w [i + 1] = a:
We observe that the latter question can be e¢ ciently decided using elemen-
tary tools of linear algebra.
We conclude that: if dim (cS (i)) is small for all i; then the trellis T (c) is a
small trellis that can be e¢ ciently computed from the algebraic representation
of the code c:
8.4 Altogether
If we put all the pieces together, we get a communication protocol for binary
symmetric channels:
1. Alice and Bob agree on a Gallager code c that is good for the channel at
hand: the capacity is close to the capacity of the channel, and the error
probability (or the minimum distance) is small enough for Alice and Bob.
44
2. Alice use a generator matrix of the code c for encoding.
3. Bob computes the trellis T (c) ; and uses this trellis together with Viterbi
Algorithm for decoding.
This very schematic communication protocol is, to some extent, the commu-
nication protocol of common use. The reader should notice, at this point, that
we can solve, algorithmically, each one of the above three tasks.
References
[1] A. Vardy. Trellis structure of codes. in V. S. Pless and W. C. Hu¤man, Eds.,
Handbook of Coding Theory. Amsterdam, Elsevier Science, pp. 1989–2118,
1998.
45
9 Good Codes for the Erasure Channel
In this lecture, the last lecture of this booklet, we study the existence of good
codes for the erasure channel.
pw (X) = a1 + a2 X + + ak X k 1
:
46
Let c1 ; :::cn be n di¤erent elements of a large enough …nite …eld F, and let
pw (X) be the above polynomial. We can evaluate pw (X) at the points c1 ; :::cn :
If we do the latter we get a new codeword, the vector (pw (c1 ) ; :::; pw (cn )) : If
n > k; the latter vector is resilient: this vector can be fully recovered from
any subvector of dimension k: The latter means that the information content
of each entry is distributed over all the k-dimensional subvectors, and it makes
hard to lost this information. This fact allows us to think in the following coding
strategy:
Let w = (w1 ; :::; wk ) be the message to be stored, choose k + d di¤ erent
elements of F, say the elements c1 ; :::; ck+d , and compute the vector
If one stores the vector T k+d (w), and there occur less than d deletions, he
will be able to reconstruct the polynomial pw (X) and the original message w
using the non-deleted entries.
A code like T k+d is a Reed-Solomon Code [1]. Those codes are the algebraic
codes that are constructed by representing messages as polynomials, and by
evaluating those polynomials at large subsets of the ground …eld.
Singleton’s inequality implies that codes satisfying the equality k+d = m are
optimal (maximum distance separable codes). We get that Reed-Solomon codes
are optimal. However, in despite of its optimality those codes are unpractical:
Exercise 109 Prove that the coding complexity and the decoding complexity of
Reed-Solomon codes is O n log2 (n) (Hint: you have to use the Fast Fourier
Transform).
It can be proved that the coding complexity and the decoding complexity
of Reed-Solomon codes is n log2 (n) : If n is large, the factor log2 (n) is a
serious drawback: take into account that there exists optimal erasure codes
whose coding and decoding complexities are linear in n (see below). Moreover,
47
with a Reed-Solomon code, as with any block code, one must estimate the
erasure probability e and choose the code rate before transmission. If we are
unlucky and e is larger than expected, and the receiver receives fewer than k
symbols, the message is completely lost.
Remark 110 Tornado codes are a special type of …xed rate erasure codes which
require more redundancy than Reed–Solomon codes, but which are much faster
to generate. Software-based implementations of tornado codes are about 100
times faster on small lengths and about 10; 000 times faster on larger lengths
than Reed–Solomon erasure codes [2]. Since the introduction of Tornado codes,
many other similar erasure codes have emerged, most notably Online codes (see
[2] and the references therein).
We can use the code cG as an erasure code, and we can use the graph G for
iterative decoding using a variation of The Sum–Product Algorithm. Suppose
that a codeword w1 wN 2 cG is transmitted, and let u1 uN be the received
string. We set
I = fi N : ui = wi 6=?g :
Let bi be a node in the set fbj : j 2 Ig : Notice that the value assigned to
the latter node is known after transmission. Decoding (which corresponds to
computing the codeword that was transmitted) reduces to propagate the value
of those nodes through the code graph. The sum-product update rule reduces
simply to:
Let b be a bit node. Suppose that b is adjacent to a parity check node p; and
suppose that for all c 6= b such that c is adjacent to p we know the value of c:
Then the value of b is set equal to the mod-2 sum of all those c’s.
Since all unerased variables are correct, there is no chance that these rules
could produce a variable assignment that con‡icts with another assignment.
Let us study a special type of sparse graph codes that are used as erasure
codes, and which are decoding using iterative decoding.
48
9.3.1 Fountain codes
Fountain codes (also known as rateless erasure codes) are a class of erasure codes
with the property that a potentially limitless sequence of encoding symbols can
be generated from a given set of source symbols, in such a way that the original
source symbols can ideally be recovered from any subset of the encoding symbols
of size equal to, or only slightly larger than, the number of source symbols. The
term fountain or rateless refers to the fact that these codes do not exhibit a
…xed code rate.
A fountain code is optimal if the original k source symbols can be recovered
from any k encoding symbols. Fountain codes are known that have e¢ cient
encoding and decoding algorithms that allow the recovery of the original k source
symbols from any k’of the encoding symbols with high probability, where k’is
just slightly larger than k.
Fountain codes were created by Michael Luby at his company Digital Foun-
tain, the …rst company whose business is based on sparse-graph codes (see next
lecture). Luby invented LT codes, a special type of fountain codes in 1998.
A LT code works as follows:
Each encoded packet tn is produced from the source …le s1 sk as follows:
1. Find a check node tm that is connected to only one source packet si ; (if
there is no such check node, this decoding algorithm halts at this point,
and fails to recover all the source packets).
(a) Set si = tm .
(b) Add si to all checks tr that are connected to si :
(c) Remove all the edges connected to the source packet sk .
49
The above probabilistic strategy comes to an end, if and only if, the halting
condition in step one is not reached before all the source packets are determined.
The latter can be ensured, with a high probability, if one chooses the right degree
distribution : occasional encoded packets must have high degree (i.e., d similar
to k) in order to ensure that there are not some source packets that are connected
to no-one. On the other hand, many packets must have low degree, so that the
decoding process can get started, and keep going, and so that the total number
of addition operations involved in the encoding and decoding is kept small.
Prove that the expected degree under this distribution is equal to ln (k) :
The ideal soliton distribution behaves almost as required, but is does not
work in practice. The robust soliton distribution has two extra parameters, c
and ; and it is designed to ensure that the expected number of degree-one
checks is about
k p
S c ln k
Luby’s key result is that (for an appropriate value of the constant c) receiving
k = k + 2 ln S S packets ensures that all packets can be recovered with
probability at least 1 .
Remark 113 Raptor codes are the …rst known class of fountain codes with lin-
ear time encoding and decoding. Raptor codes were invented by Amin Shokrollahi
50
[3]. Raptor codes are formed by the concatenation of two codes. A …xed rate
erasure code, usually with a fairly high rate, is applied as an outer code. The
inner code takes the result of the pre-coding operation and generates a sequence
of encoding symbols. Each encoding symbol is the XOR of a pseudo-randomly
chosen set of symbols from the pre-code output. The number of symbols which
are XOR’ed together to form an output symbol is chosen pseudo-randomly for
each output symbol according to a speci…c probability distribution. This distri-
bution, as well as the mechanism for generating pseudo-random numbers for
sampling this distribution and for choosing the symbols to be XOR’ed, must be
known to both sender and receiver.
References
[1] I. Reed, G. Solomon. Polynomial Codes over Certain Finite Fields. Journal
of the Society for Industrial and Applied Mathematics, 8 (2): 300–304, 1960.
[2] J. Byers, M. Luby, M. Mitzenmacher, A. Rege. Digital fountain approach to
reliable distribution of bulk data. ACM SIGCOMM Computer Communica-
tion Review 28(4): 1–11, 1998.
[3] A. Shokrollahi. Raptor Codes. IEEE Transactions on Information Theory,
52: 2551-2567, 2006.
51