PDP2008 Mgomes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

16th Euromicro Conference on Parallel, Distributed and Network-Based Processing

Edge Stream Oriented LDPC Decoding

Gabriel Falcão, Vitor Silva, Marco Gomes Leonel Sousa


Instituto de Telecomunicações/FCTUC INESC-ID/IST
Dep. of Electrical and Computer Engineering Dep. of Electrical and Computer Engineering
University of Coimbra Technical University of Lisbon
Portugal Portugal
{gff, vitor, marco}@co.it.pt [email protected]

Abstract Low-Density Parity-Check (LDPC) codes are powerful


error correcting codes (ECC) proposed by Robert Gallager
Low-Density Parity-Check (LDPC) codes are among the in 1962 [4]. Their rediscovery by Mackay and Neal in
best error correcting codes known and have been adopted 1996 [9] led to a growing interest on the research and devel-
by data transmission standards, such as DVB-S2 or WiMax. opment of efficient LDPC coding solutions for digital com-
They are based on binary sparse parity check matrices and munication systems. They have been adopted by the DVB-
usually represented by Tanner graphs. LDPC decoders S2 standard [2], Wimax Mobile (802.16e), Wifi (802.11n),
require very intensive message-passing algorithms, also 10Gbit Ethernet (802.3an) and other new emerging stan-
known as belief propagation. This paper proposes a very dards oriented to space, optical and storage applications.
compact stream-based data structure to represent such a LDPCs are linear (n, k) block codes defined by sparse bi-
bipartite Tanner graph, which supports both regular and nary parity check H matrices of dimension (n − k) × n.
irregular codes. This compact data structure not only re- Usually, they are represented by bipartite graphs formed by
duces the memory required to represent the graph but also Bit Nodes (BNs) and Check Nodes (CNs), linked by bidi-
puts it in an appropriate format to gather data into streams. rectional edges also called Tanner graphs [13]. The Sum
This representation also allows to map the irregular pro- Product Algorithm (SPA) is a very efficient algorithm used
cessing behavior of the Sum Product Algorithm (SPA) used for LDPC decoding [8]. It is based in the belief propagation
in LDPC decoding into the stream-based computing model. between connected nodes as indicated by the Tanner graph
Stream programs were developed for LDPC decoding and edges (see figure 1).
the results show significant speedups obtained either us- LDPC decoders demand very intensive computation and,
ing general purpose processors, or graphics processing therefore, a lot of research has been performed in the last
units. The simultaneous decoding of several codewords was few years to implement dedicated VLSI decoders [15, 3]
performed using the SIMD capabilities of modern stream- based on fixed-point arithmetic. These decoders, in partic-
based architectures available on recent processing units. ular the ones based on the SPA, belong to the class of appli-
cations for which stream-based computing architectures can
be adopted and applied, but new data representations have
1. Introduction to be developed. Additionally, the implementation in these
architectures allows floating-point arithmetic operations, so
Recent processing units have undergone increasing per- better accuracy can be expected as well as lower Bit Error
formance by exploiting the potential of stream-based archi- Rates (BER).
tectures, not only in general purpose computers but also in This paper proposes a SPA algorithm oriented to LDPC
more specific computing engines [6]. The stream comput- stream-based decoding and new data structures to effi-
ing model gathers data into a stream, processes it and then ciently represent the H matrix. The compact data represen-
scatters the output data stream to memory again. Program- tation and the development of a new stream-based approach
ming according to this type of model minimizes the memory to calculate the SPA under a specific type of recursive com-
accesses while exploits parallelism and locality of data. It puting model are the main contributions of this paper. To
allows the speeding up of applications, namely, by increas- the best of the authors’ knowledge this is also the first time
ing the parallelism using Single Instruction Multiple Data LDPC decoders are investigated under a stream-based com-
(SIMD) type of processing. puting perspective.

0-7695-3089-3/08 $25.00 © 2008 IEEE 237


DOI 10.1109/PDP.2008.12
1 1 1 0 0 0 0 0 2 SPA for LDPC Decoding
0 0 0 1 1 1 0 0
(8 bit nodes checked by 4
H=
check node equations)
1 0 0 1 0 0 1 0 The Sum Product Algorithm (SPA), as proposed by Gal-
0 1 0 0 1 0 0 1 lager, operates in the probabilistic domain [4] [9] [8].

CN0 CN1 CN2 CN3 Algorithm 1 SPA


Check node 0 updating (0) (0)
q
(i-1) bit node 0 1: qnm (0) = 1 − pn ; qnm (1) = pn ;
(i)
r 20
00
1 Eb N
q
(i-1) pn = 2yn ; N0
= 2Kσ 2
{Initialization}
10
1+e σ2
BN 0 BN 1 BN 2 BN 3 BN 4 BN 5 BN 6 BN 7
2: while (ĉ HT 6= 0 ∧ i < I) {c-decod. word;I-Max no. iterations.}
do
3: {For each node pair (BNn , CNm ), corresponding to Hmn = 1
CN0 CN1 CN2 CN3
Check node 1 updating in the parity check matrix H of the code do:}
q
(i-1)
31
bit node 4 4: {Compute the message sent from CNm to BNn , that indicates the
probability of BNn being 0 or 1:}
(i-1)
q

r
(i)
51 (Kernel 1 - Horizontal Processing)
 
14
(i) 1 1
Q (i−1)
BN 0 BN 1 BN 2 BN 3 BN 4 BN5 BN 6 BN 7 rmn (0) = 2
+ 2
1 − 2qn′ m (1) (1)
n′ ∈N(m)\n

CN0 CN2 CN2 CN3 (i) (i)


Check node 2 updating
rmn (1) = 1 − rmn (0) (2)
bit node 6
q
(i-1)
02 {where N (m)\n represents BN’s connected to CNm excluding
BNn .}
(i-1) (i)
q r
32 26
5: {Compute message from BNn to CNm :}
BN0 BN1 BN2 BN3 BN4 BN5 BN6 BN7
(Kernel 2 - Vertical Processing)
(i) Q (i)
qnm (0) = knm (1 − pn ) rm′ n (0) (3)
CN0 CN1 CN2 CN3
m′ ∈M (n)\m
Check node 3 updating
bit node 7
q
(i-1)
(i) (i) Q (i)
43 r
37 qnm (1) = knm pn rm′ n (1) (4)
m′ ∈M (n)\m
(i-1)
q
13
(i) (i)
BN0 BN1 BN2 BN3 BN4 BN5 BN6 BN7 {where knm are chosen to ensure qnm (0) + qnm (1) = 1, and
M (n)\m is the set of CN’s connected to BNn excluding CNm .}

Figure 1. Tanner graph representation of a 6: {Compute the a posteriori pseudo-probabilities:}


(i) Q (i)
8 × 4 H matrix and examples of some mes- Qn (0) = kn (1 − pn ) rmn (0)
m∈M (n)
sages being exchanged between CNm and (i) Q (i)
BNn ; a similar representation applies for CNs Qn (1) = kn pn rmn (1)
m∈M (n)
message updating. (i) (i)
{where kn are chosen to guarantee Qn (0) + Qn (1) = 1.}

7: {Perform hard decoding} ∀n,


 (i)
(i) 1 ⇐ Qn (1) > 0.5
ĉn = (i)
0 ⇐ Qn (1) < 0.5
8: end while

This paper is organized as follows. The next section ad- Given a (n, k) LDPC code, we assume BPSK modula-
dresses the aspects of the SPA algorithm that are impor- tion which maps a codeword c = (c1 , c2 , · · · , cn ) into the
c
tant to understand the proposed stream-based LDPC de- sequence x = (x1 , x2 , · · · , xn ), according to xi = (−1) i .
coders. The following section mentions properties of the Then, x is transmitted through an additive white Gaussian
stream computing model, while section 4 presents the data noise (AWGN) channel originating the received sequence
structures and the SPA flooding schedule specifically de- y = (y1 , y2 , · · · , yn ), with yi = xi + ni , and ni being a
veloped for stream-based computing. Section 5 reports ex- random variable with zero mean and variance, σ 2 = N0 /2.
perimental results obtained for SPA stream LDPC decoding SPA is depicted in Algorithm 1 [8] and it is mainly de-
on stream-based processors, such as general purpose pro- scribed by two horizontal and vertical intensive processing
cessors with multimedia extensions and graphics processing blocks defined, respectively, by equations (1), (2) and (3),
units (GPUs). Finally, section 6 concludes the paper. (4). The former two equations calculate the message up-

238
date from each CNm to BNn , considering accesses to H arrays before gathering/scattering the data. Moreover, some
in a row basis; they indicate the probability of BNn be- of them are iterative and require dynamic adjustment to be
ing 0 or 1. Figure 1 exemplifies, for a particular 4 × 8 H programmed in a stream-based style. It happens to be the
matrix, BN0 being updated by CN0 and BN4 updated by case of the LDPC decoder based in the SPA algorithm de-
CN1 . It also shows BN6 and BN7 being updated, respec- scribed in the next sections.
(i)
tively, by CN2 and CN3 . For each iteration, rmn values
are updated by equations (1) and (2) according to the edges 4 Stream-based LDPC decoding
defined in the Tanner graph. For the example in figure 1,
r00 = f (q10 , q20 ) and r26 = f (q02 , q32 ).
The H matrix defines the Tanner graph and, thus, the
Similarly, the second major processing block described
flow of messages being exchanged between CNs and BNs.
by equations (3) and (4) computes messages sent from BNn
In medium to large sized LDPC codes a huge amount of
to CNm , assuming accesses to H in a column basis. In this
(i) memory is required to store this matrix. Moreover, in a
case, qnm values hold the update defined by equations (3)
stream computing model this matrix has to be itself gath-
and (4) for the edges connectivity indicated by the Tanner
ered into a data stream. Therefore, we propose to sepa-
graph. Again, in figure 1 it can be seen that q00 = f (r20 )
rately code the information about H in two independent
and q41 = f (r34 ). Thus, equations (1) to (4) define very
data streams adequate to the SPA. These particular auxil-
irregular memory access patterns.
iary data streams establish the order of the data in the input
The iterative procedure is stopped if the decoded word
stream used in the flow of messages to update the BNs and
ĉ verifies all parity check equations and ĉ HT = 0, or a
the CNs. A stream-based SPA for implementing the LDPC
maximum number of iterations is reached.
decoder is represented in the SDF of figure 2.

3 Stream-based Computing HBN HCN HBN H CN


k1 k2 k1 k2

The stream programming style decouples computation


and data locality, which encourages compilers and program- Kernel 1 Kernel 2 Kernel 1 Kernel 2
ps0 rs0 qs0 rsi rsi qsi
mers to explore more efficiently the parallelism intrinsic to a
program. It is based on a gather, compute and scatter model
[5]. Figure 2. Synchronous Data Flow graph for
A stream program gathers a large amount of data from a stream-based LDPC decoder: the pair Ker-
memory into input streams and applies them to one or more nel 1 and Kernel 2 is repeated n times for a
kernels. A kernel consists in a group of operations per- LDPC decoder with n iterations.
formed to one or more input streams, with a desired func-
tionality implicit. A kernel can compute hundreds of op-
erations and data parallelism may be exploited in a kernel, According to Algorithm 1, the first Kernel 1 receives at
for example, by using SIMD processing. After computa- the input the data stream ps0 (p stream 0), a constant k1
tion, data feeds the input of other kernels or it is scattered to and the stream HBN that holds the necessary information
memory again. By reducing and concentrating memory ac- to perform the SPA horizontal processing indicated in equa-
cesses, this programming model converts a latency problem tions (1) and (2). It produces the output stream rs0 that is
into a bandwidth restriction. one of the input data streams of Kernel 2. The other in-
A Synchronous Data Flow (SDF) graph [7] such as the puts of this second kernel are HCN that holds information
one depicted in figure 2, can be used to explicitly express to compute the SPA vertical processing associated to equa-
data dependencies and the amount of data consumed and tions (3) and (4), and constant k2 . The last Kernel 2 pro-
produced at each input and output data flow node. Opera- duces the output stream qsi . Constants k1 and k2 represent
tions are performed by kernels that can have multiple input the dimensions of the matrix.
streams and may perform complex calculations per input
data item. The input and output streams are the only exter- 4.1 Mapping the Tanner Graph into an
nal data a kernel can access. Efficient Structure for Streaming
Some applications in the signal, image and graphics pro-
cessing fields can gather and scatter data using sequential Algorithm 1 shows that the stream-based LDPC decoder
load/store operations, while others exhibit random, or at needs different BNs and CNs update computation patterns
least irregular accesses to memory, which makes stream- for consecutive kernels. The input streams represent the
ing them more challenging. For these types of memory ac- original probabilities pn and rmn |qnm values, and the out-
cesses the addresses first need to be collected into indexed put stream corresponds to rmn |qnm updated messages. The

239
HBN

r0,0 r0,1 r0,2 r1,3 r1,4 r1,5 r2,0 r2,3 r2,6 r3,1 r3,4 r3,7
0 1 0 2 0 0 1 0 1 1 0 3 1 3 2 0 1 2 2 2 2 3 2 1

Figure 3. HBN structure. A row based data stream representing BN edges with circular addressing
in 2D shape.

HCN
(Single 1 in column 2) (Single 1 in column 5) (Single 1 in column 6) (Single 1 in column 7)

q 0,0 q0,1 q 0,2 q 1,3 q1,4 q 1,5 q2,0 q 2,3 q2,6 q3,1 q3,4 q3,7
1 2 2 1 -1 -1 1 3 2 2 -1 -1 0 0 0 3 -1 -1 0 1 1 0 -1 -1

Figure 4. HCN structure. A column based data stream representing CN edges with circular address-
ing in 2D shape.

output data stream feeds the input stream of the next pro- which BNs it should be reading information and where to
cessing kernel in a pipeline organization. update, as well. Circular addressing allows to explore a high
The codeword size (dimension) and rate nk (indicating level of parallelism. In the limit, for a multi-processor plat-
the ratio between parity and message information) imply form, a different processor can be allocated to each single
matrices with quite different sizes. If the quantity of “1’s” is edge, since the starting point becomes no longer important,
constant in all rows and if the same rule applies for columns, due to the circular addressing feature. Although in figure 3
the code is said to be regular, otherwise, it is irregular. The the elements of the structures are represented by their row
latter type of codes is known to have better coding perfor- and column addresses, the structures can be easily vector-
mance than the former but is more difficult to implement ized by considering the rows of the structure in sequence,
[8]. The stream data representation and algorithm proposed using convenient 1D or 2D reshaping according to the tar-
in this paper support both types of regular and irregular get stream-based architecture they apply to.
codes. Compact and efficient dynamic and vectorized data The same principle can be applied to code the informa-
structures HBN and HCN are proposed to represent the tion about CNs. HCN can be defined as a sequential repre-
message updating between connected nodes (BNn , CNm ). sentation of every BN connecting to all its neighboring CNs
Such structures require significantly less memory, since H (non-null elements in a single column of H). Once again,
is sparse, and are suitable for stream-based computing. the access between adjacent elements is circular, as illus-
The example of figure 1 can be used to describe the trans- trated in figure 4 for the H matrix given in figure 1. In
formation performed in H in order to produce the compact this case, a careful construction of the addresses in HCN
data stream structures HBN and HCN . HBN codes infor- is required because every (n,m) edge must be represented
mation about edge connections used in each parity check exactly in the same position as it is in HBN . This meticu-
equation (horizontal processing). This data structure is gen- lous positioning of the elements in HCN allows the iterative
erated by scanning the H matrix in a row major order and by processing of consecutive inputs to be alternately performed
(i) (i)
mapping sequentially only the BN edges associated to a sin- for rmn and qnm , guarantying that data being processed for
gle CN equation (non-null elements in a single row of H). every (n,m) edge is stored in the correct position for both
Each element of the data structure records the address of the situations. In fact, the in place computation here performed
next entry, as a pointer, and the corresponding rmn value. for horizontal and vertical processing is supported by an ef-
The element corresponding to the last non-null element of ficient mechanism that swaps data pointers [14] at the end
each row in H points to the first element of this row, im- of each iteration, according to the input data streams neces-
plementing a circular list that is used to update all the BNs sary for the next processing.
connected to a single CN. Thus, a CN always knows from Figures 3 and 4 are the resulting HBN and HCN struc-

240
. . . H BN HCN
… … …
r0,0 r 0,1 r0,2 k1 k2
HBN 0 r10,0r r0,0 r r0,1
0 r20,1
r
0 r0r0,20,2
r0,0 0,0
0 r r10,0 r0,1 0,1
0 rr20,1 r0,20,2
0 r r00,2 ps0
0 1 0,0 0 2 0,1 0 0 0,2 word n rsi qsi
. Kernel 1 Kernel 2
0 1 0 2 0 0 …
word 2
word 1

Figure 5. Simultaneous decoding concept of Figure 6. Folded Synchronous Data Flow


multiple codewords using the same H matrix. graph for a stream-based LDPC decoder.
Kernel 1 receives input streams psi , HBN and
constant k1 and produces rsi . Kernel 2 takes
the output rsi of kernel 1, HCN and constant
tures corresponding to the matrix in figure 1.
k2 as inputs and produces the output qsi , that
is feedback to the input for the next iteration.
4.2 SPA Kernels Optimization and Paral-
lelism

The flooding schedule [8] algorithm used controls the or- based architectures can be applied to Kernel 1 and Ker-
der in which BNs and CNs are updated. It guarantees that no nel 2, by simultaneously decoding several codewords (see
BN is updated before all other BNs conclude their updating figure 5). Multimedia extensions on Central Processing
procedure. This principle is adopted in this work because it Units (CPUs) and the multiple pixel processors on Graphics
allows true parallel execution of the SPA decoder. Processing Units (GPUs) are examples of SIMD computa-
The circular addressing property permits that parallel tion machines explored in this paper.
processors can simultaneously access the necessary infor-
mation stored in both HBN and HCN . Consequently, the
4.3 Memory Requirements and Computa-
overall processing time required to decode a codeword can
tional Complexity
be significantly reduced by applying parallel processing.
Figure 5 represents a detail of HBN illustrating such effi-
(i) If a good error correcting capability is required (e.g.,
cient properties for the update of rmn values.
An alternative representation of the SDF in figure 2 is with a Bit Error Rate in the order of 10−8 or lower), large
presented in figure 6. This compact SDF is obtained by matrices should be considered [8]. Although they are
folding the pair of kernels (Kernel 1 an Kernel 2) and by in- sparse, storing such information may require a considerable
troducing an additional multiplexer. Kernel 1 initially pro- amount of memory.
cesses the input stream ps0 according to the BN indexes With the proposed compact data structure, the necessary
defined in the input stream HBN . It produces the output memory can be significantly reduced. In fact, let us consider
stream rs0 by performing the message update from CNm a standard matrix used in DVB-S2 [2] with n = 64800 in-
to BNn . The stream rs0 directly feeds the input of Ker- formation bits (columns) and (n − k) = 16400 parity check
nel 2, which in turn produces the qs0 that represents mes- equations (rows), a rate of 43 and a total number of non-null
sages updated from BNn to CNm . The other input of the elements in H (numEdges) equal to 194400. The number
Kernel 2 is HCN , which defines the CN indexes from H. of elements stored can be reduced from 106.272 × 107 to
Only data streams communicate between kernels. Figure 6 only 194400, which leads to a Storage Compression Ratio
shows that by using the multiplexer at the input of the SDF, (SCR), given by (5), of 2733, less than 0.04% of the original
the output stream qsi becomes the input stream of Kernel 1 necessary memory space.
for all remaining iterations after the first one. At the end,
the qsi output stream conveys the codeword. n × (n − k)
This compact representation can be exploited to imple- SCR = (5)
numEdges × 2
ment efficient LDPC decoders, not only because of the new
compact stream oriented representations of the H matrix At the same time, the overall processing time needed
but also regarding the program memory and the data com- to access all matrix elements is reduced, improving the
munication. The required program memory is reduced by computational efficiency of the decoder. Considering
the folding mechanism and the data streams can be multi- only load/store operations, the full H storage computa-
plexed through a simple and efficient mechanism for swap- tional complexity of the SPA algorithm decreases from
ping pointers. O(n × (n − k)) to O(numEdges), i.e. from quadratic to
Additionally, SIMD processing capabilities of stream- linear complexity with the number of edges.

241
test eax, eax 450
mov edx, DWORD PTR [esi+4] 50 iterations CPU
400 50 iterations CPU SSE
movaps xmm0, xmm6 100 iterations CPU
jle SHORT .loop3 350 100 iterations CPU SSE

npad 6

Decoding Time (ms)


300
loop1: ;if(p4!=q4)
250
cmp edx, ecx
je SHORT .loop2 200

;aux value0*=(p → value0 - p → value1); 150


;v0 = mm load ps(p4 → value0);
100
;v1 = mm load ps(p4 → value1);
movaps xmm1, XMMWORD PTR [edx+16] 50

movaps xmm2, XMMWORD PTR [edx] 0


2997 4080 7632 14688 16000
;rv = mm sub ps(v0, v1); Number of Edges
subps xmm2, xmm1
;auxv0 = mm mul ps(rv, auxv0); Figure 8. Decoding processing time for a
movaps xmm1, xmm2 Core2Duo Intel CPU.
mulps xmm1, xmm0
movaps xmm0, xmm1
loop2: ;p4++;
add edx, 128 by applying SIMD processing based on the use of Arith-
sub eax, 1 metic Packed Instructions, where an elementary arithmetic
jne SHORT .loop1 operation is applied to four floating-point data elements in
loop3: a single instruction. In fact, instructions like subps or mulps
available in the SSE allow the simultaneous subtraction or
Figure 7. Loop describing a BN update oper- multiplication of four single-precision floating-point values,
ation on a x86 CPU with Packed Arithmetic respectively.
Instructions. Figure 7 illustrates the use of the SSE instructions to up-
date the BNs according to equations (1) and (2) in Algo-
rithm 1. Using xmm0-xmm7 128-bit MMX registers, four
floating-point elements are packed and operated together in
5 Experimental Results
a single instruction. Figure 8 compares the performance
of an LDPC serially decoding four codewords (one by one),
For a performance evaluation, we developed stream- against LDPC decoding the same four codewords in parallel
based LDPC decoders, both for general purpose processors (simultaneously) using SSE instructions.
(x86 CPU) with multimedia extensions and graphics pro- The LDPC decoders implemented on the x86 CPU were
cessing units (GPUs). Stream SIMD Extensions (SSE) [12] programmed in C language and run on a Core2Duo proces-
are considered to be the evolution of the first MMX tech- sor T5600 @1.83GHz. All the source code was developed
nology extensions to the Intel architecture [11]. The exper- under the Microsoft Visual Studio .NET 2003 IDE.
imental results are documented in the following sections.
It was experimentally observed that the speedup
achieved with the proposed compact representation, regard-
5.1 Stream-based SPA on the CPU ing to a brute force approach where the H matrix is fully
stored in memory, is approximately equal to the ratio be-
The proposed algorithm is based on efficient linked lists. tween the number of elements in H and the number of
It addresses only non-null elements in the H matrix and the edges. For example, for a matrix with size 1024 × 512 and
structure used to represent the edges is similar to the one 6 BNs per row this ratio is approximately 180, while for a
described in section 4. 4000×2000 matrix with 8 edges per row the measured ratio
For a given H data stream representing a Tanner graph, it is over 400.
is possible to decode N codewords in parallel, which seems Five matrices with dimensions 999 × 111 (2997 edges),
to encourage the use of Packed Instructions. In the present 816 × 408 (4080 edges), 1908 × 212 (7632 edges), 4896 ×
case, it is possible to comprise such an optimization by per- 2448 (14688 edges) and 4000×2000 (16000 edges) are con-
forming the same arithmetic operation to decode simulta- sidered. Figure 8 presents the processing times achieved for
neously four codewords. The optimization was performed the five different matrices. There is a significant speedup

242
INPUT STREAM 0 ...
x
r0/q0
vec4 qlin0 = vec4(1.0, 1.0, 1.0, 1.0);
pn vec4 rmn0 = vec4(1.0, 1.0, 1.0, 1.0);
INPUT STREAM 2 Flag
vec2 coord0, coordINI;
HCN.x INPUT STREAM 1
HCN.y ...
x HBN.x
weight HBN.y coord0 = texture2D(CaravelaTex0, coord0).yx;
x
weight
for(i=0; i< 7; i++)
MUX
MUX 1rst iteration? {
qlin0 = texture2D(CaravelaTex3, coord0);
x
rmn0 *= (1.0 - (2.0 * qlin0));
if(Flag) else ri/qi ...
{ { pn
//BNs update //CNs update Flag
}
. .
. .
.
.
. .
Figure 10. A code segment showing a BN up-
} }
date on a GPU using OpenGL (GLSL) and
Caravela.
FlowModel
x
ri/qi OUTPUT STREAM 0
pn
Flag
5.2 Stream-based SPA on the GPU

Figure 9. Flowmodel showing input and out- The pixel processors of the GPU were used with data
put data streams for LDPC decoding on a inputted as textures and the output data stream assuming
GPU using recursive computation. the usual place of the pixel color components’ output in a
graphical application. The basic idea underlying the initial
approach adopted to compute the SPA on the GPU places
the input H matrix into a texture and processes Kernel 1 and
Kernel 2 (see figure 6) in the pixel processors. The proposed
when the LDPC decoder uses SSE instructions, as depicted compact representation also allows to reduce data transfer
from matrix 999 × 111 up to 1908 × 212. Above the lat- between the host memory (RAM) and the video memory
ter one, the speedup degrades abruptly, causing a reduction (VRAM), which is a very important aspect to achieve high
on the performance related with the cache. In fact, the time performance with GPUs.
savings introduced by the use of SSE instructions seem to Figure 9 depicts input and output data streams for the
be unable to compensate the increase of cache misses af- program developed for GPU and figure 10 shows a source
ter a certain size of the H matrix. For SSE instructions, code segment of an efficient program that makes use of
four floating-point elements are read/written from/to mem- SIMD multiply (∗) and subtract (−) instructions that op-
ory, against only one element per non-SSE instruction. For erate in four floating-point data elements in the same tex-
this reason, cache misses increase when SSE instructions ture (Red, Green, Blue, Alpha). The shader program was
are used. The same program (LDPC decoder with SSE) was developed in OpenGL (GLSL) an runs on the pixel proces-
experimentally tested in different CPUs having L2 cache sors. We use the Caravela programming interface recently
memory with twice the size of the one used in this work, developed for stream-based computing on GPUs [14]. The
and the results reinforce our previous statements. With a Caravela library is used in the CPU side to manage the exe-
larger cache, the cache misses decrease and a significant cution of the program in the GPU and to write and read data
gain appears even for large matrices using SSE. to and from the VRAM, respectively.
However, it is important to note that the intensity of com- In this subsection, a different implementation explores
putation depends rather on the number of edges, but not GPUs with dedicated architectures for stream-based com-
necessarily on the size of H. For a matrix with dimension puting [10]. The GPU used in this work is a 8800 GeForce
999 × 111 and 2997 edges, the obtained speedup is superior from NVIDIA with programmable pixel processors and
to 3.7. For that matrix, the CPU takes approximately 128ms 768MB of RAM [1].
to decode a codeword performing 100 iterations without Figure 11 shows the total execution times obtained when
SSE instructions, while the version implemented with SSE decoding the five previously mentioned matrices, either in a
executes in only 34ms. For matrices bellow 14688 edges, CPU using SSE instructions, or in a GPU. The GPU imple-
the gain obtained using SSE is a consequence of simultane- mentation not only uses the compact stream data structure
ously decoding four codewords. developed, but also takes advantage of the parallel process-

243
which improved its efficiency. The proposed data struc-
300
tures and algorithms pave the way for future work in the
50 iterations CPU SSE
250 50 iterations GPU
area of stream-based applications dedicated to the intensive
100 iterations CPU SSE decoding of error correcting codes, by exploiting low cost
and highly flexible alternatives against dominating hard-
Decoding Time (ms)

100 iterations GPU


200
ware dedicated VLSI LDPC decoders.
150

References
100

[1] URL – http://www.nvidia.com/page/geforce 8800.html.


50 [2] Digital video broadcasting (DVB); second generation fram-
ing structure, channel coding and modulation systems for
0 broadcasting, interactive services, news gathering and other
2997 4080 7632 14688 16000
Number of Edges broad-band satellite applications. Technical Report EN 302
307 V1. 1.1, European Telecommunications Standards Insti-
tute (ETSI), 2005.
Figure 11. Decoding processing time for a [3] G. Falcão, M. Gomes, J. Gonçalves, P. Faia, and V. Silva.
8800 GTX GPU versus a CPU with SSE for the HDL Library of Processing Units for an Automatic LDPC
same five matrices used in figure 8. Decoder Design. In IEEE PhD. Research in Microelectron-
ics and Electronics (PRIME), pages 349–352, Jun 2006.
[4] R. G. Gallager. Low-Density Parity-Check Codes. IRE
Transactions on Information Theory, 8(1):21–28, Jan 1962.
ing capabilities provided by the pixel processors. As before, [5] J. Gummaraju and M. Rosenblum. Stream Programming
the GPU also uses SIMD instructions (GLSL), simultane- on General-Purpose Processors. In 38th Annual IEEE/ACM
ously decoding different codewords in parallel. International Symposium on Microarchitecture (MICRO),
Results show that the speedup increases as the number pages 343–354, Nov 2005.
of edges being processed also increase. Moreover, beyond [6] H. Hofstee. Power Efficient Processor Architecture and the
a certain number of edges, a major percentage of time is Cell Processor. In 11th International Symposium on High-
Performance Computer Architecture (HPCA), pages 258–
wasted in the memory data transfer between the host RAM
262, Feb 2005.
and VRAM. For matrices with 2997 and 4080 edges, the [7] E. Lee and D. Messerschmitt. Static Scheduling of Syn-
GPU processing power is not able to compensate the delay chronous Data Flow Programs for Digital Signal Processing.
introduced by data transfer between the host and VRAM. IEEE Transactions on Computers, 36(1):24–35, Jan 1987.
As the size of matrices increases, the processing power [8] S. Lin and D. J. Costello. Error Control Coding. Prentice
gains more importance in the overall execution time. This Hall, second edition, 2004.
is the main reason that explains the obtention of a better [9] D. J. C. Mackay and R. M. Neal. Near Shannon Limit Per-
speedup for the largest matrix (16000 edges), where the formance of Low Density Parity Check Codes. IEE Elec-
GPU executes 100 iterations in less than 109ms against al- tronics Letters, 32(18):1645–1646, 1996.
[10] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris,
most 300ms in the CPU with SSE instructions.
J. Kruger, A. E. Lefohn, and T. J. Purcell. A Survey of
Comparing the two SPA streaming implementations, General-Purpose Computation on Graphics Hardware. In
CPU with SSE versus GPU, it is possible to conclude that Eurographics 2005, pages 21–51, Aug 2005.
the GPU performs better for huge quantities of data, due to [11] A. Peleg and U. Weiser. MMX Technology Extension to the
its parallelism features and tremendous processing power. Intel Architecture. IEEE Micro, 16(4):42–50, Aug 1996.
[12] S. Raman, V. Pentkovski, and J. Keshava. Implementing
streaming SIMD extensions on the Pentium III processor.
6. Conclusions IEEE Micro, 20(4):47–57, Jul–Aug 2000.
[13] R. Tanner. A Recursive Approach to Low Complex-
This paper proposes a LDPC decoder using stream- ity Codes. IEEE Transactions on Information Theory,
based architectures to perform the very intensive compu- 27(5):533–547, Sep 1981.
tation that such codes demand. Pursuing this goal, dynamic [14] S. Yamagiwa and L. Sousa. Caravela: A Novel Stream-
and vectorized compact data structures were created to rep- based Distributed Computing Environment. IEEE Com-
resent all exchanged messages between the Tanner graph puter, 40(5):70–77, May 2007.
[15] T. Zhang and K. Parhi. Joint (3,k)-regular LDPC code and
nodes. These data structures allow a significant reduction
decoder/encoder design. IEEE Transactions on Signal Pro-
of the memory space and the processing time necessary for cessing, 52(4):1065–1079, Apr 2004.
LDPC decoding. The Sum Product Algorithm was adapted
to stream computing suiting different hardware platforms,

244

You might also like