No of Flips For First Head

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

ECE 534: Elements of Information Theory, Fall 2010

Homework 1

Solutions

Ex. 2.1 (Davide Basilio Bartolini)

Text

Coin Flips. A fair coin is flipped until the first head occurs. Let X denote the number of flips
required.

(a) Find the Entropy H(X) in bits


(b) A random variable X is drawn according to this distribution. Find an “efficient” sequence of
yes-no questions of the form, “Is X contained in the set S?”. Compare H(X) to the expected
number of questions required to determine X.

Solution

(a) The random variable X is on the domain X = {1, 2, 3, . . .} and it denotes the number of flips
needed to get the first head, i.e. 1 + the number of consecutive tails appeared before the first
head.
Since the coin is said to be fair, we have p(“head”) = p(“tail”) = 12 and hence (exploiting the
independence of the coin flips):

1
p(X = 1) = p(“head”) =
2
 2
1 1 1
p(X = 2) = p(“tail”) ∗ p(“head”) = ∗ =
2 2 2
..
.
ntimes  n
z }| { 1 1 1 1
p(X = n) =p(“tail”) ∗ . . . ∗ p(“tail”) ∗p(“head”) = ∗ . . . ∗ ∗ =
2 2 2 2

from this, it is clear that the probability mass distribution of X is:

 x
1
pX (x) =
2

1
Once the distribution is known, H(X) can be computed from the definition:

X
H(X) = − pX (x) log2 pX (x)
x∈X
∞  x  x
X 1 1
=− log2
2 2
x=1
∞  x  x
X 1 1
=− log2 (since the summed expr. equals 0 for x = 0)
2 2
x=0
∞  x  
X 1 1
=− x log2 (property of logarithms)
2 2
x=0
 X ∞  x
1 1
= − log2 x
2 2
x=0
∞  x ∞
!
1
X 1 X k
= x= 2
2 = 2 [bit] exploiting (k)x x =
x=0
2 1 − 21 x=0
(1 − k)2

(b) Since the most likely value for X is 1 (p(X = 1) = 12 ), the most efficient first question is: “Is
X = 1?”; the next question will be “Is X = 2?” and so on, until a positive answer is found.
If this strategy is used, the random variable Y representing the number of questions will have
the same distribution as X and it will be:

∞  y 1
X 1 2
E [Y ] = y =  =2
2 1 2
y=0 1 − 2

which is exactly equal to the entropy of X.


An interpretation of this fact could be that 2 bits (which is the entropy value for X) are the
amount of memory required to store the outcomes of the two binary questions which are enough
(on average) to get a positive answer on the value of X.

Exercise 2.4 (Matteo Carminati)

Entropy of functions of a random variable. Let X be a discrete random variable. Show that the
entropy of a function of X is less than or equal to the entropy of X by justifying the following
steps:

(a) (b)
H(X, g(X)) = H(X) + H(g(X)|X) = H(X) (1)
(c) (d)
H(X, g(X)) = H(g(X)) + H(X|g(X)) ≥ H(g(X)) (2)

2
Thus, H(g(X)) ≤ H(X).

Solution

(a) It comes from entropy’s chain rule applied to random variables X and g(X), i.e. H(X, Y ) =
H(X) + H(Y |X), so H(X, g(X)) = H(X) + H(g(X)|X).

(b) Intuitively, if g(x) depends only on X and if the value of X is known, g(X) is completely
specified and it has a deterministic value. The entropy of a deterministic value is 0, so
H(g(X)|X) = 0 and H(X) + H(g(X)|X) = H(X).

(c) Again, this formula comes from the entropy’s chain rule, in the form: H(X, Y ) = H(Y ) +
H(X|Y ).

(d) Proving that H(g(X)) + H(X|g(X)) ≥ H(g(X)) means proving that H(X|g(X)) ≥ 0: the
non-negativity is one of the property of entropy and can be proved from its definition by noting
that the logarithm of a probability (a quantity always less than or equal to 1) is non-positive.
In particular H(X|g(X)) = 0 if the knowledge of the value of g(X) allows to totally specify
the value of X; otherwise H(X|g(X)) > 0 (for example if g(X) is an injective function).

Ex. 2.7(a) (Davide Basilio Bartolini)

Text

Coin weighing. Suppose that one has n coins, among which there may or may not be one counterfeit
coin. If there is a counterfeit coin, it may be either heavier or lighter than the other coins. The
coins are to be weighed by a balance.
Find an upper bound on the number of coins n so that k weighings will find the counterfeit coin
(if any) and correctly declare it to be heavier or lighter.

Solution

Let X be a string of n characters on the alphabet X = {−1, 0, 1}n , each of which represents one
coin. Each of the characters of X may have three different values (say 1 if the coin is heavier than
a normal one, 0 if it is regular, −1 if it is lighter). Since only one of the coins may be counterfeit,
X may be a string of all 0 (if all the coins are regular) or may present either a 1 or a −1 at only
one position. Thus, the possible configurations for X are 2n + 1.
Under the hypothesis of a uniform distribution of the probability of which coin is counterfeit, the
entropy of X will be:

H(X) = log2 (2n + 1)

3
Now let Z = [Z1 , Z2 , . . . , Zk ] be a random variable representing the weighings; each of the Zi will
have three possible values to indicate whether the result of the weighing is balanced, left arm
heavier or right arm heavier. The entropy of each Zi will be upper-bounded by the three possible
values it can assume: H(Zi ) ≤ log2 3, ∀i ∈ [1, k] and for Z (under the hypothesis of independence
of the weighings):

k
(ChainR.) X
H(Z) = H(Z1 , Z2 , . . . , Zk ) = H(Zi |Zi−1 , . . . , Z1 )
i =1
k
(Indep.) X
= H(Zi ) ≤ log2 3
i =1

Since we want to know how many weghings will yield the same amount of information which is
given by the configuration of X (i.e. we want to know how many weighings will be needed to find
out which coin - if any - is counterfeit), we can write:

H(X) = H(Z) ≤ log2 3


log2 (2n + 1) ≤ log2 3
2n + 1 ≤ 3k
3k − 1
n≤ , which is the wanted upper bound.
2

Ex. 2.12 (Kenneth Palacio)

X Y 0 1
0 1/3 1/3
1 0 1/3

Table 1: p(x,y) for problem 2.12.

Find:
(a) H(X), H(Y ).
(b) H(X|Y ), H(Y |X).
(c) H(X, Y ).
(d) H(Y ) − H(Y |X).
(e) I(X; Y ).
(f) Draw a Venn diagram for the quantities in parts (a) through (e).

Solution:

4
Compute of marginal distributions:
2 1
p(x) = [ , ]
3 3
1 2
p(y) = [ , ]
3 3

(a) H(X), H(Y ).

   
2 2 1 1
H(X) = − log2 − log2 (3)
3 3 3 3
H(X) = 0.918bits (4)

   
1 1 2 2
H(Y ) = − log2 − log2 (5)
3 3 3 3
H(Y ) = 0.918bits (6)

Figure 1: H(X), H(Y)

(b) H(X|Y ), H(Y |X).

1
X
H(X|Y ) = p(y = i)H(X|Y = y) (7)
i=0
1 2
H(X|Y ) = H(X|Y = 0) + H(X|Y = 1) (8)
3 3
1 2
H(X|Y ) = H(1, 0) + H(1/2, 1/2) (9)
3 3
H(X|Y ) = 2/3 (10)

5
1
X
H(Y |X) = p(x = i)H(Y |X = x) (11)
i=0
2 1
H(Y |X) = H(Y |X = 0) + H(Y |X = 1) (12)
3 3
2 1
H(Y |X) = H(1/2, 1/2) + H(0, 1) (13)
3 3
H(Y |X) = 2/3 (14)

Figure 2: H(X|Y ), H(Y |X)

(c) H(X, Y ).

1,1
X
H(X, Y ) = p(x, y) log2 p(x, y) (15)
x=0,y=0
 
1 1
H(X, Y ) = −3 log2 (16)
3 3
H(X, Y ) = 1.5849625bits (17)

Figure 3: H(X,Y)

6
(d) H(Y ) − H(Y |X).

H(Y ) − H(Y |X) = 0.918 − 2/3 (18)


H(Y ) − H(Y |X) = 0.25134 (19)

Figure 4: H(Y ) − H(Y |X)

(e) I(X; Y ).

 
X p(x, y)
I(X; Y ) = p(x, y) log2 (20)
x,y
p(x)p(y)
     
1 1/3 1 1/3 1 1/3
I(X; Y ) = log2 + log2 + log2 (21)
3 (2/3)(1/3) 3 (2/3)(2/3) 3 (1/3)(2/3)
I(X; Y ) = 0.25162916 (22)

Figure 5: I(X;Y)

(f) Venn diagram is already shown for each item.

Ex. 2.20 (Kenneth Palacio)

Run-length coding. Let X1 ,X2 , . . . , Xn be (possibly dependent) binary random variables.


Suppose that one calculates the run lengths R = (R1 , R2 , . . .) of this sequence (in order as they
occur). For example, the sequence X = 0001100100 yields run lengths R = (3, 2, 2, 1, 2). Compare

7
H(X1 , X2 , ..., Xn ), H(R), and H(Xn, R). Show all equalities and inequalities, and bound all the
differences.

Solution:

Lets assume that one random variable Xj (0 < j ≤ n) is known, then if R is also known, H(Xj,R)
will provide the same information about uncertainty than H(X1,X2, . . Xj, . . ,Xn), since the whole
sequence of X can be completely recovered from the knowledge of Xj and R. For example, with
X5 = 1 and the run lengths R = (3, 2, 2, 1, 2) it’s possible to recover the original sequence as follows:

X5 = 1, R = (3, 2, 2, 1, 2) leads to recover the sequence: X = 0001100100.

It can be concluded that:


H(Xj, R) = H(X1, X2, ....Xn)

Using the chain rule, H(Xj,R) can be written as:

H(Xj, R) = H(R) + H(Xj|R)

H(Xj|R) ≤ H(Xj), since conditioning reduces entropy.

Then it’s possible to write:

H(Xj) ≥ H(X1, X2, ...Xn) − H(R)

H(Xj) + H(R) ≥ H(X1, X2, ...Xn)

Computing H(Xj) = nx p(Xj) log2 Xj, where the distribution of Xj is unknown, it can be as-
P
sumed to be: a probability of p for Xj=0 and of (1-p) for Xj=1. It can be observed that the
maximum entropy is given when p=1/2 leading max H(Xj)= 1.

Then:

1 + H(R) ≥ H(X1, X2, ...Xn)

Considering the results obtained in problem 2.4, we can write also that: H(R) ≤ H(X). Because
R is a function of X.

You might also like