A Brief Introduction To Polar Codes
A Brief Introduction To Polar Codes
A Brief Introduction To Polar Codes
1 Introduction
Polar codes were introduced by Erdal Arıkan in 2009 and they provide the first deterministic construc-
tion of capacity-achieving codes for binary memoryless symmetric (BMS) channels [1]. They are the
culmination of Arıkan’s research into the computational cutoff rate of sequential decoding.
In particular, consider a setup where (U1 , U2 ) are two
equiprobable bits that are encoded into (X1 , X2 ) = (U1 ⊕ u1 x1 W y1
U2 , U2 ). Then, (X1 , X2 ) are mapped to (Y1 , Y2 ) by two inde-
pendent BMS channels with transition probabilities P(Y1 = u2 x2 W y2
y | X1 = x) = P(Y2 = y | X2 = x) = W (y|x). The Tanner
graph for this setup is shown in Figure 1. Since the mapping Figure 1: Factor Graph
from (U1 , U2 ) to (X1 , X2 ) is invertible, one finds that
I(U1 , U2 ; Y1 , Y2 ) = I(X1 , X2 ; Y1 , Y2 ) = I(X1 ; Y1 ) + I(X2 ; Y2 ) = 2I(W ),
where C = I(X1 ; Y1 ) = I(W ) is the capacity of symmetric BMS channel because X1 is equiprobable and
X
W (y|0) log2 W (y|0)/ 21 W (y|0) + 12 W (y|1) = I(X1 ; Y1 ).
I(W ) ,
y
Thus, the transformation (X1 , X2 ) = (U1 ⊕ U2 , U2 ) preserves the sum capacity of the system. Readers
unfamiliar with information theory should consult the primer in Appendix A.
Also, the chain rule decomposition
I(U1 , U2 ; Y1 , Y2 ) = I(U1 ; Y1 , Y2 ) + I(U2 ; Y1 , Y2 |U1 ) = 2I(W )
implies that one can also achieve the rate 2I(W ) using two steps. First, information is transmitted
through the first virtual channel W − : U1 → (Y1 , Y2 ). By coding over multiple channel uses, one can
achieve any rate up to I(U1 ; Y1 , Y2 ). If all the U1 ’s are known from the first stage, then one can decode
the information transmitted through the second virtual channel W + : U2 → (Y1 , Y2 , U1 ). Again, coding
allows one to achieve any rate up to I(U2 ; Y1 , Y2 |U1 ).
If one chooses convolutional codes with sequential decoding, however, the expected decoding-complexity
per bit becomes unbounded for rates above the computational cutoff rate
R0 (W ) = 1 − log2 (1 + Z(W )),
P p
where Z(W ) , y W (y|0)W (y|1) is the Bhattacharyya parameter of the channel [2]. Arıkan’s key
observation was that, while the sum capacity of the two virtual channels is preserved, the sum cutoff
rate satisfies
R0 (W + ) + R0 (W − ) ≥ 2R0 (W ),
with equality iff R0 (W ) ∈ {0, 1}. Thus, repeated transformations of this type cause the implied virtual
channels to polarize into extremal channels whose capacities approach 0 or 1. But, for these extremal
channels, coding is trivial and one either sends an information bit or a dummy bit.
From a theoretical point of view, polar codes are beautifully simple. Practically, they approach
capacity rather slowly as their blocklength increases [3]. Still, based on recent advances [4], they have
been chosen as the short-block codes for the 5G standard.
1
2 Mathematical Background
2.1 Kronecker Products
Pn−1
For a positive integer n, let N = 2n and let (d0 d1 . . . dn−1 )2 , 1 + i=0 di 2i denote the integer (indexed
from 1) with binary digit string d0 d1 , . . . , dn−1 . Much of the following material can be found in [5]. For
matrices A = {ai,j } and B = {bi,j } , let A ⊗ B represent the Kronecker product
a1,1 B a1,2 B · · ·
A ⊗ B , a2,1 B a2,2 B · · · .
.. .. ..
. . .
Kronecker powers are defined by A⊗n , A ⊗ A⊗n−1 = A⊗n−1 ⊗ A. The following well-known identity
will also be useful
(A ⊗ B)(C ⊗ D) = AC ⊗ BD.
An N × N permutation matrix PN satisfies PNT PN = IN , where IN is the N × N identity matrix,
because it preserves the Euclidean distance between pairs of vectors and is therefore orthogonal. An
important permutation matrix is the “mod-p perfect shuffle” Sp,q with N = pq. Like all permutation
matrices, it is defined by where it maps the unit row vectors em for m = 1, 2, . . . , N . In particular, Sp,q
is defined by eTm0 = Sp,q eTm iff m0 = b(m − 1)/pc + 1 + (m − 1 mod p)q. For example, we have
Let A and B be m1 × n1 and m2 × n2 matrices, respectively. Then, mod-p perfect shuffle allows one to
write the identity
T
B ⊗ A = Sm 1 ,m2
(A ⊗ B)Sn1 ,n2 .
The basic idea is that A ⊗ B differs from B ⊗ A only in the order of the rows and columns and mod-p
shuffle matrices can be used to rearrange the rows and columns of A ⊗ B into those of B ⊗ A. In
particular, if A is a 2 × 2 matrix and B is an (N/2) × (N/2) matrix, then
T
B ⊗ A = S2,N/2 (A ⊗ B)S2,N/2
T
= RN (A ⊗ B)RN , (1)
T
where RN = S2,N/2 denotes the N × N reverse shuffle permutation matrix defined by
This holds because applying RN reorders the vector so that the LSB of the index becomes the MSB of
the index. Then, multiplying by I2 ⊗ BN/2 separately reorders the first and last halves of the vector
according to bit reversals of length N/2. Since BN reverses the bit indexing, it is symmetric and satisfies
T
BN = BN . Thus, we find also that
" #
T
T BN/2 0 T BN/2 0 T T
BN = BN = T RN = RN = (I2 ⊗ BN/2 )RN .
0 BN/2 0 BN/2
2
2.2 Soft Decoding
For simplicity, we will also assume that channel outputs are normalized into a posteriori probability
(APP) estimates that are either log-likelihood ratios (LLRs) or “probability of 1” (P1) values. For
example, an LLR-normalized channel output must satisfy
P(Y1 = y1 |X1 = 0)
y1 = ln
P(Y1 = y1 |X1 = 1)
P(Y1 = y1 |X1 = 1)
y1 = P(X1 = 1|Y1 = y1 ) = .
P(Y1 = y1 |X1 = 0) + P(Y1 = y1 |X1 = 1)
Using this, the bit-node and check-node operations used to combine estimates in LDPC decoding are
defined, respectively, by (
y1 + y2 if LLR domain
y1 y2 = y1 y2
y1 y2 +(1−y1 )(1−y2 ) if P1 domain.
and (
2 tanh−1 (tanh(y1 /2) tanh(y2 /2)) if LLR domain
y1 y2 =
y1 (1 − y2 ) + y2 (1 − y1 ) if P1 domain.
For convenience, the operation is also defined between APP values and hard bit values so that channel
symmetry can be represented by W (y|1) = W (1 y|0). For a, b ∈ {0, 1}, we also assume that 0 y = y,
a y = y a, and a (b y) = (a ⊕ b) y.
As an example of these operations, consider the (X1 , X2 ) = (U1 ⊕ U2 , U2 ) mapping defined in the
introduction. Let (y1 , y2 ) be a pair of normalized observations of (X1 , X2 ). Then, ũ1 = y1 y2 is the
optimal normalized bit estimate of U1 given (Y1 , Y2 ). Using this, ũ2 = (u1 y1 ) ỹ2 is the optimal
normalized bit estimate of U2 given (Y1 , Y2 , U1 ).
3
G2
u1 x1
u2 x2
.. .. .. .. ..
. . . GN/2 . .
uN/2−1 xN/2−1
uN/2 xN/2
RN
uN/2+1 xN/2+1
uN/2+2 xN/2+2
.. .. .. .. ..
. . . GN/2 . .
uN −1 xN −1
uN xN
u · IN/2 ⊗ G2 · RN · I2 ⊗ GN/2 = x
Figure 2: Factor graph associated with GN = (IN/2 ⊗ G2 )RN (I2 ⊗ GN/2 ) decomposition.
function x = polar_transform(u)
4
G2
u1 x1
u2 x2
.. .. .. .. ..
. GN/2 . . . .
uN/2−1 xN/2−1
uN/2 xN/2
T
RN
uN/2+1 xN/2+1
uN/2+2 xN/2+2
.. .. .. .. ..
. GN/2 . . . .
uN −1 xN −1
uN xN
u · I2 ⊗ GN/2 · T
RN · IN/2 ⊗ G2 = x
T
Figure 3: Factor graph associated with GN = (I2 ⊗ GN/2 )RN (IN/2 ⊗ G2 ) decomposition.
GN BN = BN G⊗n
2 BN
= RN (I2 ⊗ BN/2 )G⊗n T
2 (I2 ⊗ BN/2 )RN
= (BN/2 ⊗ I2 )G⊗n T
2 (BN/2 ⊗ I2 )RN RN
= (BN/2 ⊗ I2 )(G2⊗n−1 ⊗ G2 )(BN/2 ⊗ I2 )
= ((BN/2 G⊗n−1
2 ) ⊗ G2 )(BN/2 ⊗ I2 )
= (BN/2 G2⊗n−1 BN/2 ) ⊗ G2
= G⊗n−1
2 ⊗ G2 = G⊗n
2 .
GN GN = GN BN G⊗n ⊗n ⊗n
2 = G2 G2 = (G2 G2 )
⊗n
= I2⊗n = IN . (2)
Using GN = G⊗n T
2 BN and BN = (I2 ⊗ BN/2 )RN , one can also get the alternative recursive decomposition
GN = G⊗n T
2 (I2 ⊗ BN/2 )RN
= (G2 ⊗ G⊗n−1
2
T
)(I2 ⊗ BN/2 )RN
= (G2 ⊗ (G2⊗n−1 BN/2 ))RN
T
T
= (G2 ⊗ GN/2 )RN
T
= (I2 ⊗ GN/2 )(G2 ⊗ IN/2 )RN
T
= (I2 ⊗ GN/2 )RN (IN/2 ⊗ G2 ).
The factor graph associated with this decomposition is shown in Figure 3. From this, we see that this
decomposition is quite natural for a recursive successive-cancellation decoding algorithm because one can
5
T
first perform a “soft-inversion” of RN (IN/2 ⊗ G2 ) using LDPC code message-passing operations and then
recursively call the polar decoding algorithm for two length N/2 polar transforms defined by I2 ⊗ GN/2 .
We note that the arguments to the polar decoding algorithm are APP estimates of the bits in the output
vector and a priori probability estimates for the bits in the input vector (e.g., the a priori estimates
identify frozen bits and their values). The output from the polar decoder is a vector of hard-decision bit
estimates for the input and output bits of the polar code. The above decomposition leads naturally to
the recursive decoding algorithm defined below in Algorithm 2.
% Tunnel u decisions back up. Compute and interleave x1,x2 hard decisions
u = [uhat1 uhat2];
x = reshape([cnop(u1hardprev,u2hardprev); u2hardprev],1,[]);
end
return
6
3.2 Decoding Analysis and Code Design
Consider the recursive successive cancellation decoder for a polar transform of length N and let y =
(y1 , . . . , yN ) be the observations of the output bits x = (x1 , . . . , xN ) through N copies of the BMS
channel W . To describe the decoder analysis, we focus on the effective channels seen by each of the
(i)
inputs bits in u = (u1 , . . . , uN ). Let WN represent the virtual channel seen by ui during recursive
successive-cancellation decoding of a length-N polar transform. The SCD process uses the entire y1N
vector and all past decisions ûi−1 1 to generate the soft estimate ũi and hard decision ûi for bit i. Assuming
all past decisions are correct (i.e., ûi−11 = ui−1 N i−1
1 ), the effective channel maps ui ∈ {0, 1} to (y1 , u1 ) ∈
N i−1
Y × {0, 1} and is defined to be
(i)
X 1
(y1N , ui−1 WN (y1N |uN
WN 1 )|ui , 1 GN ),
2N −1
uN
i+1 ∈{0,1}
N −i
QN
where WN (y1N |xN1 ) =
N
i=1 W (yi |xi ). This formula represents an experiment where u1 is an i.i.d.
N N
equiprobable vector encoded into x1 = u1 GN and then transmitted through N independent W channels.
The decoder receives the channel observations y1N and the side information ui−1 1 . In the formula, one
can interpret the sum over uN i+1 as marginalizing out the “future” input bits that are unknown to the
decoder.
(i)
Since the output space of the channel WN depends on N , it is also useful to define the normalized
(i) (i)
channel W̃N (z|ui ), for z ∈ Z, as the output remapping of WN defined by
( P(Y N =yn ,U i−1 =ui−1 ) | U =0)
ln P(Y1N =y1n ,U1i−1 =u1i−1 ) | Ui =1) for the LLR domain
(y1N , ui−1
1 ) →z= 1 1 1 1 i
which is a sufficient statistic for Ui given (Y1N , U1i−1 ). Based on this, the transition probability satisfies
(i) (i)
X
WN (y1N , u1i−1 )|ui ,
W̃N (z|ui ) =
(y1N ,ui−1
1 )∈S(z)
where
P(Y1N =y1n ,U1i−1 =ui−1
(n o
) | Ui =0)
(y1N , ui−1
1 ) ∈ Y N
× {0, 1}i−1
z = ln 1
P(Y1N =y1n ,U1i−1 =ui−1 ) | Ui =1)
for the LLR domain
S(z) = 1
i−1 i−1 i−1
(y1N , u1 ) ∈ Y N × {0, 1}i−1 z = P(Ui = 1 | Y1N = y1n , U1 = u1 ) for the P1 domain
In this case, the output space Z stays fixed for all N and is either Z = R (for the LLR domain) or
(2i−1) (2i)
Z = [0, 1] (for the P1 domain). Later, we will see how to recursively characterize W̃N and W̃N in
(i)
terms of W̃N/2 .
7
G2
u1 x1
u3 x2
.. .. .. .. ..
. . . GN/2 . .
uN −3 xN/2−1
uN −1 xN/2
u2 xN/2+1
u4 xN/2+2
.. .. .. .. ..
. . . GN/2 . .
uN −2 xN −1
uN xN
uRN · G2 ⊗ IN/2 · I2 ⊗ GN/2 = x
Before making a hard decision, the successive-cancellation decoder combines the APP estimates ũ01
and ũ001 with the check-node operation to get ũ1 = ũ01 ũ001 . Thus, ũ1 is an erasure unless both ũ01 and
(1)
ũ001 both received correctly through their respective virtual channels. Therefore, we see that W̃N is an
(1) (i)
erasure channel if W̃N/2 is an erasure channel and we define N to be the erasure rate of the channel
(i) (1) (1)
W̃N . Moreover, we find that N = 1 − (1 − N/2 )2 . If u1 is known by the decoder (e.g., if ũ1 is not an
(2)
erasure), then the APP estimate of u2 is given by ũ2 = (u1 ũ01 ) ũ001 . Thus, W̃N will be an erasure
(2) (1)
channel with erasure probability N = (N/2 )2 .
Now, assume that the bits u2i−2 1 are all known at the decoder. Figure 4 implies that the bits
u01 , u02 , . . . , u0i−1 and u001 , u002 , . . . , u00i−1 will also be known at both GN/2 decoders. Moreover, the bits
u2i−1 , u2i are mapped through the G2 polar transform to u0i = u2i−1 ⊕ u2i and u00i = u2i . Since the
bits u01 , u02 , . . . , u0i−1 and u001 , u002 , . . . , u00i−1 are known at their respective GN/2 decoders, the bits u0i and
(i)
u00i are effectively transmitted through two independent copies of normalized virtual channel W̃N/2 (see
Figure 4).
Similar to the previous arguments ũ2i−1 is an erasure unless both ũ0i and ũ00i both received correctly
(2i−1) (i)
through their respective virtual channels. Therefore, we see that W̃N is an erasure channel if W̃N/2
(2i−1) (i)
is an erasure channel and N = 1 − (1 − N/2 )2 . If ui is known by the decoder (e.g., if ũi is not
(2i)
an erasure), then the APP estimate of u2i is given by ũ2i = (ui ũ0i ) ũ00i . Thus, W̃N is an erasure
(2i) (i) (1)
channel with erasure probability N = (N/2 )2 . Thus, we find that 1 = and
(2i−1) (i)
N = 1 − (1 − N/2 )2
(2i) (i)
N = (N/2 )2 .
8
Algorithm 3 Density Evolution Analysis of Polar Code over BEC
function E = polar_bec(n,e)
% Compute effective-channel erasure rates for polar code of length N=2^n on BEC(e)
E = e;
for i=1:n
% Interleave updates to keep in polar decoding order
E = reshape([1-(1-E).*(1-E); E.*E],1,[]);
end
This is the virtual channel associated with the decoding U1 from (Y1 , Y2 ) when (Y1 , Y2 ) are observations
of (X1 , X2 ) = (U1 ⊕ U2 , U2 ) through two independent copies of W . Likewise, let W 0 be the BMS channel
with output alphabet Y × Y × {0, 1} and transition probabilities
1
W 0 ((y1 , y2 , u1 )|u2 ) , W (y1 |u1 ⊕ u2 )W (y2 |u2 ).
2
This is the virtual channel associated with the decoding U2 from (Y1 , Y2 , U1 ) when (Y1 , Y2 ) are observa-
tions of (X1 , X2 ) = (U1 ⊕ U2 , U2 ) through two independent copies of W .
Exercise 1. Show that the binary-input channels W − and W 0 are symmetric whenever W is symmetric.
Symmetry of the W channel also allows the output remapping (y1 , y2 , u1 ) 7→ (y10 , y2 ) = (u1 ⊕ y1 , y2 )
to transform W 0 into an equivalent channel with transition probabilities
X X 1
W 0 ((u1 ⊕ y1 , y2 , u1 )|u2 ) = W (u1 ⊕ y1 |u1 ⊕ u2 )W (y2 |u2 )
2
u1 ∈{0,1} u1 ∈{0,1}
For general BMS channels, the recursive argument used above for erasure channels applies almost
exactly. The only difference is that resulting channels are not characterized by a single parameter and
the channel combining operations are related to general “density evolution” operations rather then BEC
(1)
arguments. Thus, we find that W̃1 = W and
(2i−1) (i)− (i) (i)
WN = WN/2 = WN/2 WN/2
(2i) (i)+ (i) (i)
(3)
WN = WN/2 = WN/2 WN/2 .
9
In practice, it is much more computationally efficient to normalize these channels throughout the recur-
sion. This leads to the observation that
(2i−1) (i) (i) (i) (i)
W̃N ˜ N/2 = W̃N/2
= WN/2 W ˜ W̃N/2
(2i) (i) (i) (i) (i)
W̃N ˜ N/2 = W̃N/2
= WN/2 W ˜ W̃N/2 ,
where ˜ and ˜ represent the and operations followed by channel normalization. The operators ˜
and ˜ are known as the bit and check node “density evolution” operators and were originally defined for
the analysis of LDPC codes.
Once you have a decoder, the bit-error rates of these effective channels are easily estimated via Monte
Carlo. To do this, one transmits a known vector and uses the decoder to estimate the input bits while
passing genie-aided decisions for the output bits. This is why Algorithm 2 returns noisy hard decisions
for frozen bits. For example, Algorithm 4 uses this method to characterize the effective channels of a
polar code over the BSC
% Setup parameters
N = 2^n;
f = zeros(1,N);
biterrd = zeros(1,N);
10
denote the index of the first incorrectly decoded bit, then we can upper bound the probability of block
error by
PB = P (I ∈ {1, 2, . . . , N })
N
X
≤ P (I = i)
i=1
X
≤ P Ûi 6= Ui | Û1i−1 = U1i−1
i∈A
(i)
X
≤ δN ,
i∈A
because Ûi = Ui by definition for all i ∈ Ac . Likewise, the bit error probability is upper bounded by
N
X
Pb ≤ |{j ∈ A | j ≥ i}| P (I = i)
i=1
X
≤ |{j ∈ A | j ≥ i}| P Ûi 6= Ui | Û1i−1 = U1i−1
i∈A
(i)
X
≤ |{j ∈ A | j ≥ i}| δN .
i∈A
If we want to minimize these upper bounds and send k = |A| information bits, then we can simply
(i)
choose A to contain the k indices with the smallest values of δN .
Once the set of information bits has been chosen, it is easy to compute the generator and parity-
check matrices of the code. For a matrix D and a subset B ⊆ N, let us define DB (resp. DB ) to be
the submatrix of D formed by selecting only the columns (resp. rows) of D whose indices are in B. If
/ A), then it follows that x = uA GA
the frozen bits are all zero (i.e., ui = 0 for all i ∈ A
N , where GN is the
|A| × N generator matrix for the polar code. Similarly, a parity-check matrix for the code can be derived
from 2 because xGN = uGN GN = u implies that
Thus, a natural choice for the parity-check matrix H satisfies H T = [GN ]Ac .
(i) (i)
For the erasure channel BEC(), each effective channel is a BEC with erasure probability N = 2δN
because the randomly breaking ties results in a error rate that is half of the erasure rate. These effective
channels are characterized by Algorithm 3. For the BSC, the effective channels are characterized by
Algorithm 4 using a Monte Carlo approach. The test program listed as Algorithm 6 combines the
design, encoding, and decoding procedures of polar coding system.
function f = polar_design(biterrd,d)
% Design a polar code based on effective-channel error rates
11
3.6 Achieving Capacity
The most exciting thing about polar codes is that they allow the deterministic construction of a code
that achieves capacity on any BMS channel. To show this, Arıkan used an elegant martingale argument
based on the fact that each G2 -butterfly step preserves mutual information. Later, this approach was
simplified to avoid the explicit use of martingales [6]. We use the simplified approach below to show
that the fraction of unpolarized channels converges to 0 as n → ∞.
The following lemma is required for the argument below. A proof is provided at the end of the
section.
Lemma 2. For all δ ∈ 0, 12 and any symmetric W satisfying I(W ) ∈ [δ, 1 − δ], the quantity
1
I(W + ) − I(W − )
∆(W ) , (4)
2
satisfies ∆(W )2 ≥ κ(δ) where
2
κ(δ) , min (h (2p(1 − p)) − h(p))
h−1 (δ)≤p≤h−1 (1−δ)
and h(p) is the binary entropy function. Also, κ(δ) > 0 for δ ∈ 0, 21 .
Let the average (over all channels) of the mutual information after n steps be denoted by
n+1
2X
1
(i)
µn+1 , I W2n+1
2n+1 i=1
2n
(a) 1 1 (2i−1) 1 (2i)
X
= I W2n+1 + I W2n+1
2n
i=1
2 2
n
2
(b) 1 X 1
1
(i)− (i)+
= n I W2n + I W2n
2 i=1 2 2
n
2
(c)1 X (i)
= n I W2n = µn ,
2 i=1
where (a) breaks the sum into odd/even terms, (b) follows from (3), and (c) follows from
1
I(W + ) + I(W − ) .
I(W ) =
2
By induction, µn = µ0 = I(W ).
Combining with (4), we observe that
1
I(W )2 + ∆(W )2 = I(W + )2 + I(W − )2 .
(5)
2
Let the average (over all channels) of the squared mutual-information after n steps be denoted by
n+1
2X 2
1
(i)
νn+1 , I W2n+1
2n+1 i=1
2n 2 1 2
(a) 1 X 1
(2i−1) (2i)
= n I W2n+1 + I W2n+1
2 i=1 2 2
2n 2 1 2
(b) 1 X 1
(i)+ (i)−
= n I W2 n + I W2 n
2 i=1 2 2
12
2n 2 2
1 X
(c) (i)
(i)
= n I W2n + ∆ W2n
2 i=1
n
2 2
(i)
X
= νn + ∆ W2n , (6)
i=1
where (a) breaks the sum into odd/even terms, (b) follows from (3), and (c) follows from (5). From this,
we see that νn+1 ≥ νn . Since νn ∈ [0, 1] by definition, this implies that νn converges to a limit and,
hence, that νn+1 − νn → 0.
Now, we define the fraction of δ-unpolarized channels after n steps to be
1 n
(i)
o
θn (δ) , n i ∈ [2n ] I W2n ∈ [δ, 1 − δ] .
2
Using Lemma 2, it follows that
n
2 2
(i)
X
∆ W2n ≥ θn (δ)κ(δ).
i=1
Since 0 < p < 2p(1 − p) ≤ 21 for p ∈ 0, 12 and h(p) is strictly increasing on the same set, it follows that
∆(W ) > 0 as long as p ∈ 0, 21 . For I(W ) ∈ [δ, 1 − δ], one can minimize this bound over this range
to see that ∆(W )2 ≥ κ(δ). Finally, we note that κ(δ) > 0 for δ ∈ 0, 12 because κ(δ) > 0 as long as
0 < h−1 (δ) and h−1 (1 − δ) < 12 .
13
A.1 Entropy and Mutual Information
Definition 3. The entropy (in bits) of a discrete random variable X with probability distribution p(x)
is denoted
X 1 1
H (X) , p(x) log2 = E log2 ,
p(x) p(X)
x∈X
where 0 log2 0 = 0 by continuity. The notation H(p) is used to denote H (X) when X ∼ p(x). When
there is no ambiguity, H will be used instead of H (X). The unit of entropy is determined by the base
of the logarithm with base-2 resulting in “bits” and the natural log (i.e., base-e) resulting in “nats”.
Remark 4. Roughly speaking, the entropy H (X) measures the uncertainty in the random variable X.
From a compression point of view, it equals the minimum average bit rate (in bits per symbol) to which
one can compress long sequences X1 , X2 , . . . drawn i.i.d. according to p(x). It also equals the exponential
growth rate, for i.i.d. sequences, of the smallest set that contains almost all the probability. For any
δ ∈ (0, 1), one can show that
1
H (X) = lim min Q log2 |S| .
N →∞ S∈ T ⊆X N P N
n o
(x1 ,...,x )∈T N
i=1 p(xi )>1−δ
N
Definition 5. The joint entropy (in bits) of a pair of r.v. (X, Y ) ∼ pX,Y (x, y) is denoted
X 1 1
H (X, Y ) , pXY (x, y) log2 = E log2 .
pX,Y (x, y) pX,Y (X, Y )
(x,y)∈X ×Y
Notice that this equals entropy of the conditional distribution pY |X (y|x) averaged over x.
Definition 7. For a pair of r.v. (X, Y ) ∼ pX,Y (x, y), the mutual information (in bits) between X
and Y is denoted
X pX,Y (x, y) pX,Y (X, Y )
I (X; Y ) , pX,Y (x, y) log2 = E log2 .
pX (x)pY (y) pX (X)pY (Y )
(x,y)∈X ×Y
14
The previous definitions of entropy and mutual information all extend naturally to any finite number
of random variables by treating multiple random variables as a single random vector. However, there
are some new concepts that can only be defined in terms of 3 random variables. Let X, Y , and Z be
random variables with joint distribution pX,Y,Z (x, y, z).
Definition 10. For three r.v. (X, Y, Z) ∼ pX,Y,Z (x, y, z) defined on X ×Y ×Z, the conditional mutual
information (in bits) between X and Y given Z is denoted
From this, we see that I (X; Y |Z) = H (X|Z) + H (Y |Z) − H (X, Y |Z). Thus, the conditioning is simply
inherited by each entropy in the standard decomposition.
Lemma 11. The chain rule of mutual information states that I (X; Y, Z) = I (X; Y ) + I (X; Z|Y ) .
Proof. This follows from the expectation of the decomposition
Definition 13. A DMC with binary inputs is called symmetric if there is a permutation π : Y → Y
satisfying W (π(y)|1) = W (y|0) and π(π(y)) = y for all y ∈ Y.
Example 14. For example, the binary symmetric channel (BSC) with error probability ρ has
X = Y = {0, 1} and is defined by
Example 15. For example, the binary erasure channel (BEC) with erasure probability has X =
{0, 1}, Y = {0, 1, ∗}, and is defined by
Channel coding is the process of improving performance by adding redundancy (e.g., by encoding
an K bit message into N > K bits).
15
Definition 16. For a code/decoder pair, the block error probability PB is the probability that the
decoder does not map the received sequence to the transmitted codeword. A code rate R is achievable
if there exists a sequence of encoder/decoder pairs with rate RN → R and block error rate PB,N → 0.
The channel capacity is the supremum of all achievable code rates.
Remark 17. One can get a qualitative feel for achievable rates via the following argument. The key
is that, for i.i.d. sequences (X1 , . . . , XN ) with large N , the probability distribution essentially becomes
uniform over a set of 2N H(X) “typical” sequences. Thus, for (X, Y ) ∼ p(x)W (y|x), the i.i.d. sequence
((X1 , Y1 ), . . . , (XN , YN )) takes one of 2N H(X,Y ) different typical values with essentially uniform probabil-
ity. If we ignore the X values, then the number of (Y1 , . . . , YN ) typical sequences is roughly 2N H(Y ) . If we
fix the (X1 , . . . , XN ) sequence to a typical value (x1 , . . . , xN ), then the number of ((x1 , Y1 ), . . . , (xN , YN ))
typical sequences is roughly 2N H(Y |X) . This last set of sequences can be seen as the likely set of output
sequences if x is transmitted. Thus, if the likely output sets of each codeword fill the space but do not
overlap, then we get 2N R 2N H(Y |X) = 2N H(Y ) or R = H (Y ) − H (Y |X) = I (X; Y ).
Theorem 18 (Channel Coding Theorem). For a DMC, the channel capacity is given by
X W (y|x)
C , max I (X; Y ) = max p(x)W (y|x) log2 P 0 )W (y|x0 )
.
p(x) p(x) x0 p(x
(x,y)∈X ×Y
Thus, for any R ≤ C, there exists a sequence of encoder/decoder pairs such that RN → R and PB,N → 0.
Conversely, if a sequence of encoder/decoder pairs satisfies RN → R and PB,N → 0, then R ≤ C.
Remark 19. For a symmetric DMC, the capacity C is achieved by the uniform input distribution p(x) =
1/ |X |.
References
[1] E. Arıkan, “Channel polarization: A method for constructing capacity-achieving codes for symmetric
binary-input memoryless channels,” IEEE Trans. Inform. Theory, vol. 55, pp. 3051–3073, July 2009.
[2] E. Arikan, “An upper bound on the cutoff rate of sequential decoding,” IEEE Trans. Inform. Theory,
vol. 34, pp. 55–63, Jan. 1988.
[3] S. H. Hassani, K. Alishahi, and R. Urbanke, “Finite-length scaling for polar codes,” IEEE Trans.
Inform. Theory, vol. 60, no. 10, pp. 5875–5898, 2014.
[4] I. Tal and A. Vardy, “List decoding of polar codes,” IEEE Trans. Inform. Theory, vol. 61, no. 5,
pp. 2213–2226, 2015.
[5] G. H. Golub and C. F. Van Loan, Matrix computations. Johns Hopkins University Press, 3rd ed.,
2012.
[6] M. Alsan and E. Telatar, “A simple proof of polarization and polarization for non-stationary memo-
ryless channels,” IEEE Trans. Inform. Theory, vol. 62, no. 9, pp. 4873–4878, 2016.
[7] T. M. Cover and J. A. Thomas, Elements of Information Theory. Wiley Series in Telecommunications,
Wiley, 2nd. ed., 2006.
16
Algorithm 6 Main Test Program for Polar Coding System
% Transmit
if (bec)
y = x;
y(rand(1,N)<e)=1/2;
else
y = zeros(1,N)+p;
y(x==1) = 1-y(x==1);
err = rand(1,N)<p;
y(err) = 1-y(err);
end
17