A Brief Introduction To Polar Codes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

A Brief Introduction to Polar Codes

Notes for Introduction to Error-Correcting Codes


Henry D. Pfister
October 8th, 2017

1 Introduction
Polar codes were introduced by Erdal Arıkan in 2009 and they provide the first deterministic construc-
tion of capacity-achieving codes for binary memoryless symmetric (BMS) channels [1]. They are the
culmination of Arıkan’s research into the computational cutoff rate of sequential decoding.
In particular, consider a setup where (U1 , U2 ) are two
equiprobable bits that are encoded into (X1 , X2 ) = (U1 ⊕ u1 x1 W y1
U2 , U2 ). Then, (X1 , X2 ) are mapped to (Y1 , Y2 ) by two inde-
pendent BMS channels with transition probabilities P(Y1 = u2 x2 W y2
y | X1 = x) = P(Y2 = y | X2 = x) = W (y|x). The Tanner
graph for this setup is shown in Figure 1. Since the mapping Figure 1: Factor Graph
from (U1 , U2 ) to (X1 , X2 ) is invertible, one finds that
I(U1 , U2 ; Y1 , Y2 ) = I(X1 , X2 ; Y1 , Y2 ) = I(X1 ; Y1 ) + I(X2 ; Y2 ) = 2I(W ),
where C = I(X1 ; Y1 ) = I(W ) is the capacity of symmetric BMS channel because X1 is equiprobable and
X
W (y|0) log2 W (y|0)/ 21 W (y|0) + 12 W (y|1) = I(X1 ; Y1 ).
 
I(W ) ,
y

Thus, the transformation (X1 , X2 ) = (U1 ⊕ U2 , U2 ) preserves the sum capacity of the system. Readers
unfamiliar with information theory should consult the primer in Appendix A.
Also, the chain rule decomposition
I(U1 , U2 ; Y1 , Y2 ) = I(U1 ; Y1 , Y2 ) + I(U2 ; Y1 , Y2 |U1 ) = 2I(W )
implies that one can also achieve the rate 2I(W ) using two steps. First, information is transmitted
through the first virtual channel W − : U1 → (Y1 , Y2 ). By coding over multiple channel uses, one can
achieve any rate up to I(U1 ; Y1 , Y2 ). If all the U1 ’s are known from the first stage, then one can decode
the information transmitted through the second virtual channel W + : U2 → (Y1 , Y2 , U1 ). Again, coding
allows one to achieve any rate up to I(U2 ; Y1 , Y2 |U1 ).
If one chooses convolutional codes with sequential decoding, however, the expected decoding-complexity
per bit becomes unbounded for rates above the computational cutoff rate
R0 (W ) = 1 − log2 (1 + Z(W )),
P p
where Z(W ) , y W (y|0)W (y|1) is the Bhattacharyya parameter of the channel [2]. Arıkan’s key
observation was that, while the sum capacity of the two virtual channels is preserved, the sum cutoff
rate satisfies
R0 (W + ) + R0 (W − ) ≥ 2R0 (W ),
with equality iff R0 (W ) ∈ {0, 1}. Thus, repeated transformations of this type cause the implied virtual
channels to polarize into extremal channels whose capacities approach 0 or 1. But, for these extremal
channels, coding is trivial and one either sends an information bit or a dummy bit.
From a theoretical point of view, polar codes are beautifully simple. Practically, they approach
capacity rather slowly as their blocklength increases [3]. Still, based on recent advances [4], they have
been chosen as the short-block codes for the 5G standard.

1
2 Mathematical Background
2.1 Kronecker Products
Pn−1
For a positive integer n, let N = 2n and let (d0 d1 . . . dn−1 )2 , 1 + i=0 di 2i denote the integer (indexed
from 1) with binary digit string d0 d1 , . . . , dn−1 . Much of the following material can be found in [5]. For
matrices A = {ai,j } and B = {bi,j } , let A ⊗ B represent the Kronecker product
 
a1,1 B a1,2 B · · ·
A ⊗ B ,  a2,1 B a2,2 B · · ·  .
 
.. .. ..
. . .

Kronecker powers are defined by A⊗n , A ⊗ A⊗n−1 = A⊗n−1 ⊗ A. The following well-known identity
will also be useful
(A ⊗ B)(C ⊗ D) = AC ⊗ BD.
An N × N permutation matrix PN satisfies PNT PN = IN , where IN is the N × N identity matrix,
because it preserves the Euclidean distance between pairs of vectors and is therefore orthogonal. An
important permutation matrix is the “mod-p perfect shuffle” Sp,q with N = pq. Like all permutation
matrices, it is defined by where it maps the unit row vectors em for m = 1, 2, . . . , N . In particular, Sp,q
is defined by eTm0 = Sp,q eTm iff m0 = b(m − 1)/pc + 1 + (m − 1 mod p)q. For example, we have

S2,M/2 (s1 , s2 , . . . , sM )T = (s1 , s3 , . . . , sM −1 , s2 , s4 , . . . , sM )T


S3,M/3 (s1 , s2 , . . . , sM )T = (s1 , s4 , . . . , sM −2 , s2 , s5 , . . . , sM −1 , s3 , s6 , . . . , sM )T .

Let A and B be m1 × n1 and m2 × n2 matrices, respectively. Then, mod-p perfect shuffle allows one to
write the identity
T
B ⊗ A = Sm 1 ,m2
(A ⊗ B)Sn1 ,n2 .
The basic idea is that A ⊗ B differs from B ⊗ A only in the order of the rows and columns and mod-p
shuffle matrices can be used to rearrange the rows and columns of A ⊗ B into those of B ⊗ A. In
particular, if A is a 2 × 2 matrix and B is an (N/2) × (N/2) matrix, then
T
B ⊗ A = S2,N/2 (A ⊗ B)S2,N/2
T
= RN (A ⊗ B)RN , (1)
T
where RN = S2,N/2 denotes the N × N reverse shuffle permutation matrix defined by

(s1 , s2 , . . . , sN )RN = (s1 , s3 , . . . , sN −1 , s2 , s4 , . . . , sN ).


T
It is easy to verify that RN = S2,N/2 by transposing the previous equation.
Looking closely, we find that if m = (d0 d1 . . . dn−1 )2 and em0 = em RN , then m0 = (d1 d2 . . . dn−1 d0 )2 .
Therefore, RN first reorders the vector so that the base-2 expansion of the index experiences a left
circular shift. This maps the least significant bit (LSB) of the index to the most significant bit (MSB)
of the index and preserves the order of the other n − 1 bits of the index. Now, the N × N bit reversal
permutation matrix BN can be defined recursively with in terms of RN using
 
BN/2 0
BN = RN = RN (I2 ⊗ BN/2 ).
0 BN/2

This holds because applying RN reorders the vector so that the LSB of the index becomes the MSB of
the index. Then, multiplying by I2 ⊗ BN/2 separately reorders the first and last halves of the vector
according to bit reversals of length N/2. Since BN reverses the bit indexing, it is symmetric and satisfies
T
BN = BN . Thus, we find also that
" #
T  
T BN/2 0 T BN/2 0 T T
BN = BN = T RN = RN = (I2 ⊗ BN/2 )RN .
0 BN/2 0 BN/2

2
2.2 Soft Decoding
For simplicity, we will also assume that channel outputs are normalized into a posteriori probability
(APP) estimates that are either log-likelihood ratios (LLRs) or “probability of 1” (P1) values. For
example, an LLR-normalized channel output must satisfy

P(Y1 = y1 |X1 = 0)
y1 = ln
P(Y1 = y1 |X1 = 1)

while a P1-normalized channel output must satisfy

P(Y1 = y1 |X1 = 1)
y1 = P(X1 = 1|Y1 = y1 ) = .
P(Y1 = y1 |X1 = 0) + P(Y1 = y1 |X1 = 1)

Using this, the bit-node and check-node operations used to combine estimates in LDPC decoding are
defined, respectively, by (
y1 + y2 if LLR domain
y1  y2 = y1 y2
y1 y2 +(1−y1 )(1−y2 ) if P1 domain.
and (
2 tanh−1 (tanh(y1 /2) tanh(y2 /2)) if LLR domain
y1  y2 =
y1 (1 − y2 ) + y2 (1 − y1 ) if P1 domain.
For convenience, the operation  is also defined between APP values and hard bit values so that channel
symmetry can be represented by W (y|1) = W (1  y|0). For a, b ∈ {0, 1}, we also assume that 0  y = y,
a  y = y  a, and a  (b  y) = (a ⊕ b)  y.
As an example of these operations, consider the (X1 , X2 ) = (U1 ⊕ U2 , U2 ) mapping defined in the
introduction. Let (y1 , y2 ) be a pair of normalized observations of (X1 , X2 ). Then, ũ1 = y1  y2 is the
optimal normalized bit estimate of U1 given (Y1 , Y2 ). Using this, ũ2 = (u1  y1 )  ỹ2 is the optimal
normalized bit estimate of U2 given (Y1 , Y2 , U1 ).

3 The Polar Transformation


3.1 Algebraic and Factor-Graph Description
The polar transform of size N is defined to be
 ⊗n
1 0
GN , BN = BN G⊗n
2 ,
1 1

Since B2 = I2 , one finds that  


1 0
G2 =
1 1
and this transform is a fundamental building block (e.g., see Figure 1) in Arıkan’s construction of
polar codes. The main idea is to construct an message vector u where the elements ui with i ∈ A ⊆
{1, 2, . . . . , N } carry information and the other elements uj with j ∈ Ac contain values known at the
transmitter and receiver (e.g., are fixed to 0’s). Then, the codeword x = uGN is transmitted over the
channel.
Various recursive definitions of GN will also be useful. For example, the recursion BN = RN (I2 ⊗
BN/2 ) implies that

GN = RN (I2 ⊗ BN/2 )G⊗n


2
= RN (I2 ⊗ BN/2 )(G2 ⊗ G⊗n−1
2 )
= RN (G2 ⊗ BN/2 G2⊗n−1 )
= RN (G2 ⊗ GN/2 )

3
G2

u1 x1

u2 x2

.. .. .. .. ..
. . . GN/2 . .

uN/2−1 xN/2−1

uN/2 xN/2
RN
uN/2+1 xN/2+1

uN/2+2 xN/2+2

.. .. .. .. ..
. . . GN/2 . .

uN −1 xN −1

uN xN

 
u · IN/2 ⊗ G2 · RN · I2 ⊗ GN/2 = x

Figure 2: Factor graph associated with GN = (IN/2 ⊗ G2 )RN (I2 ⊗ GN/2 ) decomposition.

= RN (G2 ⊗ IN/2 )(I2 ⊗ GN/2 ).


With this result, we see that computing x = uGN is equivalent to a reverse shuffle of u followed by an
expanded G2 -butterfly section and then separate polar transforms of length-N/2 applied to the first and
T
second halves of the vector. From (1), we know RN (G2 ⊗ IN/2 )RN = (IN/2 ⊗ G2 ) and this implies that
GN = RN (G2 ⊗ IN/2 )(I2 ⊗ GN/2 )
= (IN/2 ⊗ G2 )RN (I2 ⊗ GN/2 ).
The factor graph associated with this decomposition is shown Figure 2. From this, we see that computing
x = uGN using this formula is equivalent to a local G2 -butterfly section followed by a reverse shuffle
and then separate polar transforms of length-N/2 applied to the first and second halves of the vector.
This representation leads naturally to the recursive encoding algorithm implied by [1] and implemented
below in Algorithm 1.

Algorithm 1 Recursive Implementation of Polar Transform in Matlab

function x = polar_transform(u)

% Recurse down to length 1


if (length(u)==1)
x = u;
else
% Compute odd/even outputs of (I_{N/2} \otimes G_2) transform
u1u2 = mod(u(1:2:end)+u(2:2:end),2);
u2 = u(2:2:end);

% R_N maps odd/even indices (i.e., u1u2/u2) to first/second half


x = [polar_transform(u1u2) polar_transform(u2)];
end

4
G2

u1 x1

u2 x2

.. .. .. .. ..
. GN/2 . . . .

uN/2−1 xN/2−1

uN/2 xN/2
T
RN
uN/2+1 xN/2+1

uN/2+2 xN/2+2

.. .. .. .. ..
. GN/2 . . . .

uN −1 xN −1

uN xN

 
u · I2 ⊗ GN/2 · T
RN · IN/2 ⊗ G2 = x

T
Figure 3: Factor graph associated with GN = (I2 ⊗ GN/2 )RN (IN/2 ⊗ G2 ) decomposition.

Next, we show that BN commutes with G⊗n2 and, hence, GN = BN G⊗n ⊗n


2 = G2 BN . Since BN BN =
⊗n ⊗n
IN , we use induction to show that GN BN = BN G2 BN = G2 . For N = 2, this is trivial because
B2 = I2 . For the inductive step, we write

GN BN = BN G⊗n
2 BN
= RN (I2 ⊗ BN/2 )G⊗n T
2 (I2 ⊗ BN/2 )RN
= (BN/2 ⊗ I2 )G⊗n T
2 (BN/2 ⊗ I2 )RN RN
= (BN/2 ⊗ I2 )(G2⊗n−1 ⊗ G2 )(BN/2 ⊗ I2 )
= ((BN/2 G⊗n−1
2 ) ⊗ G2 )(BN/2 ⊗ I2 )
= (BN/2 G2⊗n−1 BN/2 ) ⊗ G2
= G⊗n−1
2 ⊗ G2 = G⊗n
2 .

It is worth noting that this implies that GN = G−1


N because

GN GN = GN BN G⊗n ⊗n ⊗n
2 = G2 G2 = (G2 G2 )
⊗n
= I2⊗n = IN . (2)

Using GN = G⊗n T
2 BN and BN = (I2 ⊗ BN/2 )RN , one can also get the alternative recursive decomposition

GN = G⊗n T
2 (I2 ⊗ BN/2 )RN
= (G2 ⊗ G⊗n−1
2
T
)(I2 ⊗ BN/2 )RN
= (G2 ⊗ (G2⊗n−1 BN/2 ))RN
T

T
= (G2 ⊗ GN/2 )RN
T
= (I2 ⊗ GN/2 )(G2 ⊗ IN/2 )RN
T
= (I2 ⊗ GN/2 )RN (IN/2 ⊗ G2 ).

The factor graph associated with this decomposition is shown in Figure 3. From this, we see that this
decomposition is quite natural for a recursive successive-cancellation decoding algorithm because one can

5
T
first perform a “soft-inversion” of RN (IN/2 ⊗ G2 ) using LDPC code message-passing operations and then
recursively call the polar decoding algorithm for two length N/2 polar transforms defined by I2 ⊗ GN/2 .
We note that the arguments to the polar decoding algorithm are APP estimates of the bits in the output
vector and a priori probability estimates for the bits in the input vector (e.g., the a priori estimates
identify frozen bits and their values). The output from the polar decoder is a vector of hard-decision bit
estimates for the input and output bits of the polar code. The above decomposition leads naturally to
the recursive decoding algorithm defined below in Algorithm 2.

Algorithm 2 Recursive Implementation of P1-Domain Polar Decoder in Matlab

function [u,x] = polar_decode(y,f)


% y = bit APP from channel in output order % f = input a priori probs in input order
% x = output hard decision in output order % u = input hard decisions in input order

% Recurse down to length 1


N = length(y);
if (N==1)
if (f==1/2)
% If info bit, make hard decision based on observation
x = round(y); u = x;
else
% Use frozen bit for output and hard decision for input (for monte carlo design)
x = f; u = round(y);
end
else
% Compute soft mapping back one stage
u1est = cnop(y(1:2:end),y(2:2:end));

% R_N^T maps u1est to top polar code


[uhat1,u1hardprev] = polar_decode(u1est,f(1:(N/2)));

% Using u1est and x1hard, we can estimate u2


u2est = vnop(cnop(u1hardprev,y(1:2:end)),y(2:2:end));

% R_N^T maps u2est to bottom polar code


[uhat2,u2hardprev] = polar_decode(u2est,f((N/2+1):end));

% Tunnel u decisions back up. Compute and interleave x1,x2 hard decisions
u = [uhat1 uhat2];
x = reshape([cnop(u1hardprev,u2hardprev); u2hardprev],1,[]);
end
return

% Check-node operation in P1 domain


function z=cnop(w1,w2)
z = w1.*(1-w2) + w2.*(1-w1);
return

% Bit-node operation in P1 domain


function z=vnop(w1,w2)
z = w1.*w2 ./ (w1.*w2 + (1-w1).*(1-w2));
return

6
3.2 Decoding Analysis and Code Design
Consider the recursive successive cancellation decoder for a polar transform of length N and let y =
(y1 , . . . , yN ) be the observations of the output bits x = (x1 , . . . , xN ) through N copies of the BMS
channel W . To describe the decoder analysis, we focus on the effective channels seen by each of the
(i)
inputs bits in u = (u1 , . . . , uN ). Let WN represent the virtual channel seen by ui during recursive
successive-cancellation decoding of a length-N polar transform. The SCD process uses the entire y1N
vector and all past decisions ûi−1 1 to generate the soft estimate ũi and hard decision ûi for bit i. Assuming
all past decisions are correct (i.e., ûi−11 = ui−1 N i−1
1 ), the effective channel maps ui ∈ {0, 1} to (y1 , u1 ) ∈
N i−1
Y × {0, 1} and is defined to be

(i)
X 1
(y1N , ui−1 WN (y1N |uN

WN 1 )|ui , 1 GN ),
2N −1
uN
i+1 ∈{0,1}
N −i

QN
where WN (y1N |xN1 ) =
N
i=1 W (yi |xi ). This formula represents an experiment where u1 is an i.i.d.
N N
equiprobable vector encoded into x1 = u1 GN and then transmitted through N independent W channels.
The decoder receives the channel observations y1N and the side information ui−1 1 . In the formula, one
can interpret the sum over uN i+1 as marginalizing out the “future” input bits that are unknown to the
decoder.
(i)
Since the output space of the channel WN depends on N , it is also useful to define the normalized
(i) (i)
channel W̃N (z|ui ), for z ∈ Z, as the output remapping of WN defined by
( P(Y N =yn ,U i−1 =ui−1 ) | U =0)
ln P(Y1N =y1n ,U1i−1 =u1i−1 ) | Ui =1) for the LLR domain
(y1N , ui−1
1 ) →z= 1 1 1 1 i

P(Ui = 1 | Y1N = y1n , U1i−1 = ui−1


1 ) for the P1 domain,

which is a sufficient statistic for Ui given (Y1N , U1i−1 ). Based on this, the transition probability satisfies
(i) (i)
X
WN (y1N , u1i−1 )|ui ,

W̃N (z|ui ) =
(y1N ,ui−1
1 )∈S(z)

where
P(Y1N =y1n ,U1i−1 =ui−1
(n o
) | Ui =0)
(y1N , ui−1
1 ) ∈ Y N
× {0, 1}i−1
z = ln 1
P(Y1N =y1n ,U1i−1 =ui−1 ) | Ui =1)
for the LLR domain
S(z) =  1
i−1 i−1 i−1

(y1N , u1 ) ∈ Y N × {0, 1}i−1 z = P(Ui = 1 | Y1N = y1n , U1 = u1 ) for the P1 domain

In this case, the output space Z stays fixed for all N and is either Z = R (for the LLR domain) or
(2i−1) (2i)
Z = [0, 1] (for the P1 domain). Later, we will see how to recursively characterize W̃N and W̃N in
(i)
terms of W̃N/2 .

3.3 The Erasure Channel


(i)
To understand the recursive formula for W̃N , we focus first on the case of W = BEC(). This will allow
us to leverage the simplicity of the erasure channel. Using this, it is easy to verify from Figure 1 that
the soft estimate ũ1 is an erasure unless both y1 and y2 are received correctly through their respective
(1)
channels. This implies that W̃2 = BEC(1 − (1 − )2 ). Likewise, if the decoder is given knowledge u1 ,
(2)
then y1 and y2 act as independent observations of u2 and we find that W̃2 = BEC(2 ). One can easily
verify that these results are generic and hold for message-passing decoding on the factor graph, MAP
decoding on the factor graph, and the recursive successive cancellation decoder described in Algorithm 2.
To characterize the normalized virtual channels for larger N , we use induction based on decomposition
shown in Figure 4. Let u0 = (u01 , . . . , u0N/2 ) denote the input bits to the upper GN/2 block and u00 =
(u001 , . . . , u00N/2 ) denote the input bits to the lower GN/2 block. In this figure, we see that input bits u1 , u2
are mapped through the G2 polar transform to u01 = u1 ⊕ u2 and u001 = u2 . The bits u01 and u001 are
(1)
effectively transmitted through two independent copies of the normalized virtual channel W̃N/2 .

7
G2

u1 x1

u3 x2

.. .. .. .. ..
. . . GN/2 . .

uN −3 xN/2−1

uN −1 xN/2

u2 xN/2+1

u4 xN/2+2

.. .. .. .. ..
. . . GN/2 . .

uN −2 xN −1

uN xN

 
uRN · G2 ⊗ IN/2 · I2 ⊗ GN/2 = x

Figure 4: Factor graph associated with GN = RN (G2 ⊗ IN/2 )(I2 ⊗ GN/2 ).

Before making a hard decision, the successive-cancellation decoder combines the APP estimates ũ01
and ũ001 with the check-node operation to get ũ1 = ũ01  ũ001 . Thus, ũ1 is an erasure unless both ũ01 and
(1)
ũ001 both received correctly through their respective virtual channels. Therefore, we see that W̃N is an
(1) (i)
erasure channel if W̃N/2 is an erasure channel and we define N to be the erasure rate of the channel
(i) (1) (1)
W̃N . Moreover, we find that N = 1 − (1 − N/2 )2 . If u1 is known by the decoder (e.g., if ũ1 is not an
(2)
erasure), then the APP estimate of u2 is given by ũ2 = (u1  ũ01 )  ũ001 . Thus, W̃N will be an erasure
(2) (1)
channel with erasure probability N = (N/2 )2 .
Now, assume that the bits u2i−2 1 are all known at the decoder. Figure 4 implies that the bits
u01 , u02 , . . . , u0i−1 and u001 , u002 , . . . , u00i−1 will also be known at both GN/2 decoders. Moreover, the bits
u2i−1 , u2i are mapped through the G2 polar transform to u0i = u2i−1 ⊕ u2i and u00i = u2i . Since the
bits u01 , u02 , . . . , u0i−1 and u001 , u002 , . . . , u00i−1 are known at their respective GN/2 decoders, the bits u0i and
(i)
u00i are effectively transmitted through two independent copies of normalized virtual channel W̃N/2 (see
Figure 4).
Similar to the previous arguments ũ2i−1 is an erasure unless both ũ0i and ũ00i both received correctly
(2i−1) (i)
through their respective virtual channels. Therefore, we see that W̃N is an erasure channel if W̃N/2
(2i−1) (i)
is an erasure channel and N = 1 − (1 − N/2 )2 . If ui is known by the decoder (e.g., if ũi is not
(2i)
an erasure), then the APP estimate of u2i is given by ũ2i = (ui  ũ0i )  ũ00i . Thus, W̃N is an erasure
(2i) (i) (1)
channel with erasure probability N = (N/2 )2 . Thus, we find that 1 =  and

(2i−1) (i)
N = 1 − (1 − N/2 )2
(2i) (i)
N = (N/2 )2 .

8
Algorithm 3 Density Evolution Analysis of Polar Code over BEC

function E = polar_bec(n,e)
% Compute effective-channel erasure rates for polar code of length N=2^n on BEC(e)
E = e;
for i=1:n
% Interleave updates to keep in polar decoding order
E = reshape([1-(1-E).*(1-E); E.*E],1,[]);
end

3.4 General BMS Channel


Consider a BMS channel W with input alphabet X = {0, 1} and output alphabet Y. Now, we can define
two new BMS channels W − and W + . Let W − = W  W be the BMS channel with output alphabet
Y × Y and transition probabilities
1 X
W − ((y1 , y2 )|u1 ) = (W  W ) ((y1 , y2 )|u1 ) , W (y1 |u1 ⊕ u2 )W (y2 |u2 ).
2
u2 ∈{0,1}

This is the virtual channel associated with the decoding U1 from (Y1 , Y2 ) when (Y1 , Y2 ) are observations
of (X1 , X2 ) = (U1 ⊕ U2 , U2 ) through two independent copies of W . Likewise, let W 0 be the BMS channel
with output alphabet Y × Y × {0, 1} and transition probabilities
1
W 0 ((y1 , y2 , u1 )|u2 ) , W (y1 |u1 ⊕ u2 )W (y2 |u2 ).
2
This is the virtual channel associated with the decoding U2 from (Y1 , Y2 , U1 ) when (Y1 , Y2 ) are observa-
tions of (X1 , X2 ) = (U1 ⊕ U2 , U2 ) through two independent copies of W .
Exercise 1. Show that the binary-input channels W − and W 0 are symmetric whenever W is symmetric.
Symmetry of the W channel also allows the output remapping (y1 , y2 , u1 ) 7→ (y10 , y2 ) = (u1 ⊕ y1 , y2 )
to transform W 0 into an equivalent channel with transition probabilities
X X 1
W 0 ((u1 ⊕ y1 , y2 , u1 )|u2 ) = W (u1 ⊕ y1 |u1 ⊕ u2 )W (y2 |u2 )
2
u1 ∈{0,1} u1 ∈{0,1}

= W (y1 |u2 )W (y2 |u2 ).


The resulting channel is equivalent to having two independent observations of the same input. Therefore,
we define W + = W  W be the BMS channel with output alphabet Y × Y and transition probabilities
W + ((y1 , y2 )|u2 ) = (W  W ) ((y1 , y2 )|u2 ) , W (y1 |u2 )W (y2 |u2 ).
While the output alphabet of the channels W − (y|x) and W + (y|x) nominally consists of all pairs y =
(y1 , y2 ) ∈ Y × Y, these channels can always be “normalized” to output a minimal sufficient statistic for
the input. Similar to our previous discussion, we let W̃ (z|x) be transition probabilities of the channel
X → Z defined by remapping the output of the X → Y channel (defined by W (y|x)) with
(
W (Y |0)
ln W (Y |1) if LLR domain
Z=
W (Y |1)/ 12 W (Y |0) + 21 W (Y |1) if P1 domain.


For general BMS channels, the recursive argument used above for erasure channels applies almost
exactly. The only difference is that resulting channels are not characterized by a single parameter and
the channel combining operations are related to general “density evolution” operations rather then BEC
(1)
arguments. Thus, we find that W̃1 = W and
(2i−1) (i)− (i) (i)
WN = WN/2 = WN/2  WN/2
(2i) (i)+ (i) (i)
(3)
WN = WN/2 = WN/2  WN/2 .

9
In practice, it is much more computationally efficient to normalize these channels throughout the recur-
sion. This leads to the observation that
(2i−1) (i) (i) (i) (i)
W̃N ˜ N/2 = W̃N/2 
= WN/2 W ˜ W̃N/2
(2i) (i) (i) (i) (i)
W̃N ˜ N/2 = W̃N/2 
= WN/2 W ˜ W̃N/2 ,

where  ˜ and ˜ represent the  and  operations followed by channel normalization. The operators  ˜
and ˜ are known as the bit and check node “density evolution” operators and were originally defined for
the analysis of LDPC codes.
Once you have a decoder, the bit-error rates of these effective channels are easily estimated via Monte
Carlo. To do this, one transmits a known vector and uses the decoder to estimate the input bits while
passing genie-aided decisions for the output bits. This is why Algorithm 2 returns noisy hard decisions
for frozen bits. For example, Algorithm 4 uses this method to characterize the effective channels of a
polar code over the BSC

Algorithm 4 Monte Carlo Estimate of Polar Code over the BSC

function [biterrd] = polar_bsc(n,p,M)


% Send M blocks for Monte Carlo estimate of length N=2^n polar code on BSC(p)

% Setup parameters
N = 2^n;
f = zeros(1,N);
biterrd = zeros(1,N);

% Monte Carlo evaluation of error probability


for i=1:M
% Transmit all-zero codeword through BSC(p)
y = zeros(1,N)+p;
y(rand(1,N)<p)=1-p;
% Decode received vector using all-zero frozen vector
[uhat,xhat] = polar_decode(y,f);
biterrd = biterrd + uhat;
end
biterrd = biterrd/M;

3.5 The Design of Polar Codes


We have now seen how one can use recursive channel-evolution to evaluate the quality of each of the
virtual channels. As described, this procedure uses “normalized” channels to prevent an increase in
the output dimension at each stage. In particular, the code-design process corresponds to using this
evolution procedure to evaluate the error probability of each virtual channel and then choosing the set
A to contain the indices of channels with sufficiently low error probability. For the other channels, one
simply transmits fixed dummy bits (e.g., zeros) that are known to the receiver in advance. Thus, these
can be recovered by the receiver without error.
The successive cancellation decoder is successful as long as each of the virtual channels in A is
decoded correctly. Let (U1 , . . . , UN ) be the message vector and (Û1 , . . . , ÛN ) be the vector of hard
decisions produced by successive cancellation decoder. Let
 
(i)
δN = P Ûi 6= Ui | Û1i−1 = U1i−1
(i)
be the hard decision error probability of the virtual channel WN . If we let the r.v.
n o
I = min i ∈ {1, 2, . . . , N } | Ûi 6= Ui

10
denote the index of the first incorrectly decoded bit, then we can upper bound the probability of block
error by

PB = P (I ∈ {1, 2, . . . , N })
N
X
≤ P (I = i)
i=1
X  
≤ P Ûi 6= Ui | Û1i−1 = U1i−1
i∈A
(i)
X
≤ δN ,
i∈A

because Ûi = Ui by definition for all i ∈ Ac . Likewise, the bit error probability is upper bounded by
N
X
Pb ≤ |{j ∈ A | j ≥ i}| P (I = i)
i=1
X  
≤ |{j ∈ A | j ≥ i}| P Ûi 6= Ui | Û1i−1 = U1i−1
i∈A
(i)
X
≤ |{j ∈ A | j ≥ i}| δN .
i∈A

If we want to minimize these upper bounds and send k = |A| information bits, then we can simply
(i)
choose A to contain the k indices with the smallest values of δN .
Once the set of information bits has been chosen, it is easy to compute the generator and parity-
check matrices of the code. For a matrix D and a subset B ⊆ N, let us define DB (resp. DB ) to be
the submatrix of D formed by selecting only the columns (resp. rows) of D whose indices are in B. If
/ A), then it follows that x = uA GA
the frozen bits are all zero (i.e., ui = 0 for all i ∈ A
N , where GN is the
|A| × N generator matrix for the polar code. Similarly, a parity-check matrix for the code can be derived
from 2 because xGN = uGN GN = u implies that

[xGN ]Ac = x [GN ]Ac = uAc = 0.

Thus, a natural choice for the parity-check matrix H satisfies H T = [GN ]Ac .
(i) (i)
For the erasure channel BEC(), each effective channel is a BEC with erasure probability N = 2δN
because the randomly breaking ties results in a error rate that is half of the erasure rate. These effective
channels are characterized by Algorithm 3. For the BSC, the effective channels are characterized by
Algorithm 4 using a Monte Carlo approach. The test program listed as Algorithm 6 combines the
design, encoding, and decoding procedures of polar coding system.

Algorithm 5 Design a polar code based on effective-channel error rates

function f = polar_design(biterrd,d)
% Design a polar code based on effective-channel error rates

% Sort into increasing order and compute cumulative sum


[SE,order] = sort(biterrd);
CSE = cumsum(SE);

% Find best frozen bits


k = sum(double(CSE<d));
f = zeros(1,length(biterrd));
f(order(1:k)) = 1/2;

11
3.6 Achieving Capacity
The most exciting thing about polar codes is that they allow the deterministic construction of a code
that achieves capacity on any BMS channel. To show this, Arıkan used an elegant martingale argument
based on the fact that each G2 -butterfly step preserves mutual information. Later, this approach was
simplified to avoid the explicit use of martingales [6]. We use the simplified approach below to show
that the fraction of unpolarized channels converges to 0 as n → ∞.
The following lemma is required for the argument below. A proof is provided at the end of the
section.

Lemma 2. For all δ ∈ 0, 12 and any symmetric W satisfying I(W ) ∈ [δ, 1 − δ], the quantity
 

1
I(W + ) − I(W − )

∆(W ) , (4)
2
satisfies ∆(W )2 ≥ κ(δ) where
2
κ(δ) , min (h (2p(1 − p)) − h(p))
h−1 (δ)≤p≤h−1 (1−δ)

and h(p) is the binary entropy function. Also, κ(δ) > 0 for δ ∈ 0, 21 .


Let the average (over all channels) of the mutual information after n steps be denoted by
n+1
2X
1 
(i)

µn+1 , I W2n+1
2n+1 i=1
2n  
(a) 1 1  (2i−1)  1  (2i) 
X
= I W2n+1 + I W2n+1
2n
i=1
2 2
n 
2 
(b) 1 X 1
  1 
(i)− (i)+
= n I W2n + I W2n
2 i=1 2 2
n
2
(c)1 X  (i) 
= n I W2n = µn ,
2 i=1

where (a) breaks the sum into odd/even terms, (b) follows from (3), and (c) follows from
1
I(W + ) + I(W − ) .

I(W ) =
2
By induction, µn = µ0 = I(W ).
Combining with (4), we observe that
1
I(W )2 + ∆(W )2 = I(W + )2 + I(W − )2 .

(5)
2
Let the average (over all channels) of the squared mutual-information after n steps be denoted by
n+1
2X 2
1 
(i)
νn+1 , I W2n+1
2n+1 i=1
2n  2 1  2 
(a) 1 X 1

(2i−1) (2i)
= n I W2n+1 + I W2n+1
2 i=1 2 2
2n  2 1  2 
(b) 1 X 1

(i)+ (i)−
= n I W2 n + I W2 n
2 i=1 2 2

12
2n   2 2 
1 X
(c) (i)

(i)
= n I W2n + ∆ W2n
2 i=1
n
2  2
(i)
X
= νn + ∆ W2n , (6)
i=1

where (a) breaks the sum into odd/even terms, (b) follows from (3), and (c) follows from (5). From this,
we see that νn+1 ≥ νn . Since νn ∈ [0, 1] by definition, this implies that νn converges to a limit and,
hence, that νn+1 − νn → 0.
Now, we define the fraction of δ-unpolarized channels after n steps to be
1 n 
(i)
 o
θn (δ) , n i ∈ [2n ] I W2n ∈ [δ, 1 − δ] .

2
Using Lemma 2, it follows that
n
2  2
(i)
X
∆ W2n ≥ θn (δ)κ(δ).
i=1

Combining with (6), we see that


νn+1 − νn
0 ≤ θn (δ) ≤ .
κ(δ)
Since νn+1 − νn → 0, this implies that θn (δ) → 0 for all δ ∈ (0, 1/2). In addition, standard results
from real analysis imply the existence of a sequence δn → 0 such that θn (δn ) → 0. Thus, the fraction
of unpolarized channels converges to 0 as n → ∞. Since µn = I(W ) for all n, this also implies that a
(i)
fraction I(W ) of the virtual channels WN become perfect (i.e., I(W ) → 1) and a fraction 1 − I(W ) of
the channels become useless (i.e., I(W ) → 0).
Due to the sequential nature of successive decoding, Arıkan also uses a more sophisticated bound to
show that successive decoding achieves a probability of error that can be made arbitrarily small for any
rate less than capacity.
Proof of Lemma 1. If X1 , X2 are binary and I(W ) = I(X1 ; Y1 ) = I(X2 ; Y2 ) = 1 − h(p), then Ms.
Gerber’s Lemma shows that

I(W − ) = 1 − H(X1 ⊕ X2 |Y1 , Y2 ) ≤ 1 − h (2p(1 − p)) .

Since I(W + ) = 2I(W ) − I(W − ), it follows that


1
I(W + ) − I(W − )

∆(W ) =
2
1
2I(W ) − I(W − ) − I(W − )

=
2
= I(W ) − I(W − )
≥ 1 − h(p) − (1 − h (2p(1 − p)))
= h (2p(1 − p)) − h(p).

Since 0 < p < 2p(1 − p) ≤ 21 for p ∈ 0, 12 and h(p) is strictly increasing on the same set, it follows that


∆(W ) > 0 as long as p ∈ 0, 21 . For I(W ) ∈ [δ, 1 − δ], one can minimize  this bound over this range
to see that ∆(W )2 ≥ κ(δ). Finally, we note that κ(δ) > 0 for δ ∈ 0, 12 because κ(δ) > 0 as long as
0 < h−1 (δ) and h−1 (1 − δ) < 12 .

A Information Theory Primer


Information theory is a very broad area and there are many excellent textbooks on the subject (e.g.,
[7]). This section is meant only to introduce basic concepts important for understanding polar codes.

13
A.1 Entropy and Mutual Information
Definition 3. The entropy (in bits) of a discrete random variable X with probability distribution p(x)
is denoted  
X 1 1
H (X) , p(x) log2 = E log2 ,
p(x) p(X)
x∈X

where 0 log2 0 = 0 by continuity. The notation H(p) is used to denote H (X) when X ∼ p(x). When
there is no ambiguity, H will be used instead of H (X). The unit of entropy is determined by the base
of the logarithm with base-2 resulting in “bits” and the natural log (i.e., base-e) resulting in “nats”.
Remark 4. Roughly speaking, the entropy H (X) measures the uncertainty in the random variable X.
From a compression point of view, it equals the minimum average bit rate (in bits per symbol) to which
one can compress long sequences X1 , X2 , . . . drawn i.i.d. according to p(x). It also equals the exponential
growth rate, for i.i.d. sequences, of the smallest set that contains almost all the probability. For any
δ ∈ (0, 1), one can show that
1
H (X) = lim min Q log2 |S| .
N →∞ S∈ T ⊆X N P N
n o
(x1 ,...,x )∈T N
i=1 p(xi )>1−δ
N

Definition 5. The joint entropy (in bits) of a pair of r.v. (X, Y ) ∼ pX,Y (x, y) is denoted
 
X 1 1
H (X, Y ) , pXY (x, y) log2 = E log2 .
pX,Y (x, y) pX,Y (X, Y )
(x,y)∈X ×Y

Notice that this is identical to H (Z) with Z = (X, Y ).


Definition 6. For a pair of r.v. (X, Y ) ∼ pX,Y (x, y), the conditional entropy (in bits) of Y given X
is denoted  
X 1 1
H (Y |X) , pX,Y (x, y) log2 = E log2 .
pY |X (y|x) pY |X (Y |X)
(x,y)∈X ×Y

Notice that this equals entropy of the conditional distribution pY |X (y|x) averaged over x.
Definition 7. For a pair of r.v. (X, Y ) ∼ pX,Y (x, y), the mutual information (in bits) between X
and Y is denoted
 
X pX,Y (x, y) pX,Y (X, Y )
I (X; Y ) , pX,Y (x, y) log2 = E log2 .
pX (x)pY (y) pX (X)pY (Y )
(x,y)∈X ×Y

Lemma 8. Basic properties of joint entropy and mutual information:


1. (chain rule of entropy) H (X, Y ) = H (X) + H (Y |X). If X and Y are independent, H(X, Y ) =
H(X) + H(Y ).
Proof: Take the expectation of log2 pX,Y 1(X,Y ) = log2 pX1(X) +log2 pY |X 1(Y |X) and note that PY |X (y|x) =
pY (y) for all x, y if X and Y are independent.
2. (mutual information) The mutual information satisfies I(X; Y ) = I(Y ; X) and
I (X; Y ) = H (X) + H (Y ) − H (X, Y ) = H (X) − H (X|Y ) = H (Y ) − H (Y |X) .
p (X,Y ) 1 1 1
Proof: Take the expectation of log2 pXX,Y
(X)pY (Y ) = log2 pX (X) +log2 pY (Y ) −log2 pX,Y (X,Y ) and apply
the chain rule as needed. Also, symmetry follows from swapping X, Y and x, y in the sum because
pX,Y (x, y) = pY,X (y, x).
Example 9. Let X = Y = {0, 1} and pX,Y (x, y) = ρ/2 if x 6= y and pX,Y (x, y) = (1 − ρ)/2 if x = y.
Since pX (x) = pY (y) = 21 , it follows that H (X) = 1 and H (Y ) = 1. Since pY |X (y|x) ∈ {ρ, 1 − ρ}, it
follows that H (Y |X) = h(ρ). Thus, we have I (X; Y ) = H (Y ) − H (Y |X) = 1 − h(ρ). The conditional
distribution pY |X called the binary symmetric channel with error probability ρ and denoted by
BSC(ρ).

14
The previous definitions of entropy and mutual information all extend naturally to any finite number
of random variables by treating multiple random variables as a single random vector. However, there
are some new concepts that can only be defined in terms of 3 random variables. Let X, Y , and Z be
random variables with joint distribution pX,Y,Z (x, y, z).
Definition 10. For three r.v. (X, Y, Z) ∼ pX,Y,Z (x, y, z) defined on X ×Y ×Z, the conditional mutual
information (in bits) between X and Y given Z is denoted

pX,Y |Z (X, Y |Z)


 
X pX,Y |Z (x, y|z)
I (X; Y |Z) , pX,Y,Z (x, y, z) log2 = E log2 .
pX|Z (x|z)pY |Z (y|z) pX|Z (X|Z)pY |Z (Y |Z)
(x,y,z)∈X ×Y×Z

From this, we see that I (X; Y |Z) = H (X|Z) + H (Y |Z) − H (X, Y |Z). Thus, the conditioning is simply
inherited by each entropy in the standard decomposition.

Lemma 11. The chain rule of mutual information states that I (X; Y, Z) = I (X; Y ) + I (X; Z|Y ) .
Proof. This follows from the expectation of the decomposition

pX,Y,Z (X, Y, Z) pX,Y (X, Y )pZ|X,Y (Z|X, Y )


log2 = log2
pX (X)pY,Z (Y, Z) pX (X)pY (Y )pZ|Y (Z|Y )
pX,Y (X, Y ) pX,Z|Y (X, Z|Y )
= log2 + log2 .
pX (X)pY (Y ) pZ|Y (Z|Y )pX|Y (X|Y )

A.2 Channel Coding


In engineering, one often wants to communicate information across an unreliable medium. For example,
think of a system that modulates the current in a wire (by adjusting the voltage at one end) and measures
the current at the other end. Due to thermal fluctuations, the difference between the modulated current
at the measured current will always contain some randomness. One can analyze this situation by first
discretizing time and then defining a simple mathematical model.
Definition 12. A discrete memoryless channel (DMC) is defined by a finite input alphabet X , a
finite output alphabet Y, and a conditional probability distribution W (y|x). For N ∈ N channel uses,
let the channel input vector be a random vector X = (X1 , . . . XN ) ∈ X N . Then, the channel output
vector is a random vector Y = (Y1 , . . . , YN ) ∈ Y N where
N
 Y
WN (y|x) , P Y = y X = x = W (yi |xi ).
t=1

Definition 13. A DMC with binary inputs is called symmetric if there is a permutation π : Y → Y
satisfying W (π(y)|1) = W (y|0) and π(π(y)) = y for all y ∈ Y.
Example 14. For example, the binary symmetric channel (BSC) with error probability ρ has
X = Y = {0, 1} and is defined by

W (y|x) = (1 − ρ)I(x = y) + ρI(x 6= y).

Example 15. For example, the binary erasure channel (BEC) with erasure probability  has X =
{0, 1}, Y = {0, 1, ∗}, and is defined by

W (y|x) = I(y = ∗) + (1 − )I(y = x).

Channel coding is the process of improving performance by adding redundancy (e.g., by encoding
an K bit message into N > K bits).

15
Definition 16. For a code/decoder pair, the block error probability PB is the probability that the
decoder does not map the received sequence to the transmitted codeword. A code rate R is achievable
if there exists a sequence of encoder/decoder pairs with rate RN → R and block error rate PB,N → 0.
The channel capacity is the supremum of all achievable code rates.
Remark 17. One can get a qualitative feel for achievable rates via the following argument. The key
is that, for i.i.d. sequences (X1 , . . . , XN ) with large N , the probability distribution essentially becomes
uniform over a set of 2N H(X) “typical” sequences. Thus, for (X, Y ) ∼ p(x)W (y|x), the i.i.d. sequence
((X1 , Y1 ), . . . , (XN , YN )) takes one of 2N H(X,Y ) different typical values with essentially uniform probabil-
ity. If we ignore the X values, then the number of (Y1 , . . . , YN ) typical sequences is roughly 2N H(Y ) . If we
fix the (X1 , . . . , XN ) sequence to a typical value (x1 , . . . , xN ), then the number of ((x1 , Y1 ), . . . , (xN , YN ))
typical sequences is roughly 2N H(Y |X) . This last set of sequences can be seen as the likely set of output
sequences if x is transmitted. Thus, if the likely output sets of each codeword fill the space but do not
overlap, then we get 2N R 2N H(Y |X) = 2N H(Y ) or R = H (Y ) − H (Y |X) = I (X; Y ).
Theorem 18 (Channel Coding Theorem). For a DMC, the channel capacity is given by
X W (y|x)
C , max I (X; Y ) = max p(x)W (y|x) log2 P 0 )W (y|x0 )
.
p(x) p(x) x0 p(x
(x,y)∈X ×Y

Thus, for any R ≤ C, there exists a sequence of encoder/decoder pairs such that RN → R and PB,N → 0.
Conversely, if a sequence of encoder/decoder pairs satisfies RN → R and PB,N → 0, then R ≤ C.
Remark 19. For a symmetric DMC, the capacity C is achieved by the uniform input distribution p(x) =
1/ |X |.

References
[1] E. Arıkan, “Channel polarization: A method for constructing capacity-achieving codes for symmetric
binary-input memoryless channels,” IEEE Trans. Inform. Theory, vol. 55, pp. 3051–3073, July 2009.
[2] E. Arikan, “An upper bound on the cutoff rate of sequential decoding,” IEEE Trans. Inform. Theory,
vol. 34, pp. 55–63, Jan. 1988.
[3] S. H. Hassani, K. Alishahi, and R. Urbanke, “Finite-length scaling for polar codes,” IEEE Trans.
Inform. Theory, vol. 60, no. 10, pp. 5875–5898, 2014.

[4] I. Tal and A. Vardy, “List decoding of polar codes,” IEEE Trans. Inform. Theory, vol. 61, no. 5,
pp. 2213–2226, 2015.
[5] G. H. Golub and C. F. Van Loan, Matrix computations. Johns Hopkins University Press, 3rd ed.,
2012.

[6] M. Alsan and E. Telatar, “A simple proof of polarization and polarization for non-stationary memo-
ryless channels,” IEEE Trans. Inform. Theory, vol. 62, no. 9, pp. 4873–4878, 2016.
[7] T. M. Cover and J. A. Thomas, Elements of Information Theory. Wiley Series in Telecommunications,
Wiley, 2nd. ed., 2006.

16
Algorithm 6 Main Test Program for Polar Coding System

% Setup code parameters


n = 12; N = 2^n;
e = 0.5; p = 0.11;
d = 0.1; bec = 1;

% Compute the quality of all effective channels


if (bec)
[biterrd] = polar_bec(n,e);
else
[biterrd] = polar_bsc(n,p,1000);
end

% Design polar code


f = polar_design(biterrd,d);
A = (f==1/2);
k = sum(A);
rate = k/N

% Run a few sims to compare with union bound


M = 10;
biterr = zeros(1,M);
for i=1:M
% Set frozen bits, add random data, and encode
u = f;
u(A) = rand(1,k)<0.5;
x = polar_transform(u);

% Transmit
if (bec)
y = x;
y(rand(1,N)<e)=1/2;
else
y = zeros(1,N)+p;
y(x==1) = 1-y(x==1);
err = rand(1,N)<p;
y(err) = 1-y(err);
end

% Decode and compute error rate for info bits


[uhat,xhat] = polar_decode(y,f);
biterr(i) = mean(uhat(A)~=u(A));
end

% Display average bit and block error rate


mean(biterr)
mean(biterr>0)

17

You might also like