SummaryFeb5 2024

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Academic Year 2023/2024

Information Theory for Machine Learning


Coordinating teachers: Tobias Koch, Alejandro Lancho Serrano

In blue, references to the book Elements of Information Theory by T.M. Cover and J.A. Thomas (2nd edition)

Summary February 5, 2024


1.3 Relative Entropy and Mutual Information (Sections 2.3 and 2.4)
Definition 6 (Mutual information). Let X and Y be two random variables with joint PMF PXY taking value in X × Y.
Further let PX × PY be the product of the marginal distributions
X X
PX (x) = PXY (x, y), PY (y) = PXY (x, y)
y∈Y x∈X

That is,
PX × PY (x, y) = PX (x)PY (y)
Then, the mutual information between X and Y is defined as
I(X; Y ) ≜ D(PXY ∥PX × PY )

→ We have I(X; Y ) = I(Y ; X)


→ The mutual information between X and Y can be written in terms of entropies:
I(X; Y ) = H(Y ) − H(Y |X)
= H(X) − H(X|Y )
= H(X) + H(Y ) − H(X, Y )

Definition 7 (Conditional mutual information). Let X, Y , and Z be three random variables with joint PMF PXY Z taking
value in X × Y × Z. Then, the conditional mutual information between X and Y given Z is defined as
I(X; Y |Z) ≜ H(X|Z) − H(X|Y, Z)

→ We have
I(X; Y |Z) = H(X|Z) − H(X|Y, Z)
= H(Y |Z) − H(Y |X, Z)
" #
PXY |Z (X, Y |Z)
= E log
PX|Z (X|Z)PY |Z (Y |Z)

Theorem 2 (Theorem 2.5.2—“Chain rule for mutual information”). Let X, Y , and Z be three random variables with joint
PMF PXY Z taking value in X × Y × Z. Then
I(X, Y ; Z) = I(X; Z) + I(Y ; Z|X) = I(Y ; Z) + I(X; Z|Y )

→ The chain rule for mutual information can be generalized to n + 1 random variables:
n
X
I(X1 , . . . , Xn ; Y ) = I(Xk ; Y |X1k−1 )
k=1

1.4 Jensen’s Inequality and Its Consequences (Section 2.6)


Definition 8 (Convex functions). A function f is said to be convex if for every x1 , x2 ∈ X and 0 ≤ λ ≤ 1,
f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 )
A function is said to be strictly convex if equality holds only if λ ∈ {0, 1} or x1 = x2 .

Definition 9 (Concave functions). A function f is said to be (strictly) concave if −f is (strictly) convex.

→ If f has a second derivative that is nonnegative everywhere, then f is convex. If the second derivative is positive,
then f is strictly convex.
→ If f has a second derivative that is nonpositive, then f is concave. If the second derivative is negative, then f is
strictly concave.
→ The above definitions also apply to functions that do not have a second derivative. (Theorem 2.6.1)

1
Theorem 3 (Theorem 2.6.2—“Jensen’s inequality”). If f is a convex function and X is a random variable, then
E[f (X)] ≥ f (E[X])
Moreover, if f is strictly convex, then E[f (X)] = f (E[X]) implies that X = E[X] with probability one.

→ Jensen’s inequality implies that, if f is a concave function and X is a random variable, then
E[f (X)] ≤ f (E[X])
Moreover, if f is strictly concave, then E[f (X)] = f (E[X]) implies that X = E[X] with probability one.

Theorem 4 (Theorem 2.6.3—“Information inequality”). Let P and Q be two PMFs taking value in X . Then
D(P ∥Q) ≥ 0
with equality if, and only if, P (x) = Q(x), x ∈ X .

You might also like