Entropy
Entropy
Entropy
Let X be a discrete random variable with alphabet X and probability mass function p(x) p(x) = P r {X = x}, x X The entropy of the variable X is dened by H (X ) =
xX
The logarithm can be in any base, but normally base 2 is used. The unit of the entropy is then called bits. If we use base b in the logarithm, we denote the entropy by Hb (X ). We can easily convert between entropies in different bases Hb (X ) = logb a Ha (X ) By convention 0 log 0 = 0, since y log y 0 as y 0. The entropy is a measure of the information content of a random variable.
Entropy, cont.
The entropy is always non-negative H (X ) 0 Proof: Since 0 p(x) 1, log p(x) 0
The entropy is always less than the logarithm of the alphabet size H (X ) log |X | with equality given a uniform distribution. Proof: see later slide
p(x)H (Y |X = x)
xX
= =
p(x)
y Y
xX y Y
Chain rule
The joint entropy can be written as the sum H (X, Y ) = H (X ) + H (Y |X ) Proof: H (X, Y ) =
xX y Y
= =
xX y Y
=
xX
= H (X ) + H (Y | X )
The chain rule can easily be generalized to larger collections of random variables. Let X1 , X2 , . . . , Xn be drawn according to p(x1 , x2 , . . . , xn ). Then we have that
n
H (X 1 , X 2 , . . . , X n ) =
i=1
H (X i | X i 1 , . . . , X 1 )
Relative entropy
The relative entropy or Kullback-Liebler distance between two probability mass functions p(x) and q (x) is dened by D (p||q ) =
xX
= .
The relative entropy is always non-negative and zero if and only if p = q . Note that the relative entropy is not a true distance.
Mutual information
The mutual information between random variables X and Y with joint probability mass function p(x, y ) and marginal probability mass functions p(x) and p(y ) is dened as I (X ; Y ) =
xX y Y
p(x, y ) log
p(x, y ) p(x)p(y )
= D (p(x, y )||p(x)p(y )) The mutual information is a measure of the amount of information that one random variable contains about another random variable.
p(x, y ) p(x, y ) log p(x)p(y ) p(x, y ) log p(x|y ) p(x) p(x, y ) log p(x|y )
xX y Y
=
xX y Y
=
xX y Y
I (X 1 , X 2 , . . . , X n ; Y ) =
i=1
I (X i ; Y | X i 1 , X i 2 , . . . , X 1 )
Jensens inequality
A function f is said to be convex over an interval (a, b) if for every x1 , x2 (a, b) and 0 1 f (x1 + (1 )x2 ) f (x1 ) + (1 )x(x2 ) The function is said to be strictly convex if equality holds only if = 0 or = 1. A function f is (strictly) concave if f is (strictly) convex. Jensens inequality: If f is a convex function and X is a random variable then Ef (X ) f (EX ) If f is strictly convex, then equality implies that X = EX with probability 1, ie X is a constant.
Information inequality
Let p(x), q (x), x X be two probability mass functions. Then D (p||q ) 0 with equality if and only if p(x) = q (x) for all x. Proof: Let A = {x : p(x) > 0}. Then D (p||q ) =
xA
p(x) = q (x)
p(x) log
xA
q (x) p(x)
log log
xX
q (x)
xA
q (x) = log 1 = 0
Since log t is a strictly concave function of t, we have equality if and only if q (x)/p(x) is constant everywhere, ie p(x) = q (x)
Information theory lecture 1 p. 12/21
let p(x) be the probability mass function for X . Then D (p||u) = p(x) = log |X | H (X ) p(x) log u(x)
Conditional entropy
Conditioning reduces entropy H (X | Y ) H (X ) with equality of and only if X and Y are independent. Proof: 0 I (X ; Y ) = H (X ) H (X |Y ) Knowing another random variable Y reduces (on average) the uncertainty of variable X .
Markov chains
Random variables X , Y , Z are said to form a Markov chain in that order (denoted X Y Z ) if the conditional distribution of Z depends only on Y and is conditionally independent of X , ie if the joint probability mass function can be written as p(x, y, z ) = p(x)p(y |x)p(z |y ) X Y Z if and only if X and Z are conditionally independent given Y . Markovity gives p(x, y, z ) p(x, y )p(z |y ) p(x, z |y ) = = = p(x|y )p(z |y ) p (y ) p (y ) X Y Z implies that Z Y X . If Z = f (Y ) then X Y Z .
Proof: Functions of independent random variables are also independet random variables. Since Xi are i.i.d., so are log p(Xi ). Thus, by the law of large numbers 1 log p(X1 , X2 , . . . , Xn ) n = = 1 n log p(Xi )
i
Typical sets
The typical set A with respect to p(x) is the set of sequences (x1 , x2 , . . . , xn ) with the property 2n(H (X )+) p(x1 , x2 , . . . , xn ) 2n(H (X )) The set A
(n) (n)
1. If (x1 , x2 , . . . , xn ) A then 1 H (X ) n log p(x1 , x2 , . . . , xn ) H (X ) + 2. P r {A } > 1 for sufciently large n 3. |A | 2n(H (X )+) 4. |A | (1 )2n(H (X )) for sufciently large n The typical set has probability near 1, all elemets of the typical set are nearly equiprobable and the number of elements in the set is nearly 2nH .
Information theory lecture 1 p. 19/21
(n)
(n)
(n)
p (x )
xA
( n)
p (x )
xA (n)
( n)
(n)