PDP2008 Mgomes
PDP2008 Mgomes
PDP2008 Mgomes
r
(i)
51 (Kernel 1 - Horizontal Processing)
14
(i) 1 1
Q (i−1)
BN 0 BN 1 BN 2 BN 3 BN 4 BN5 BN 6 BN 7 rmn (0) = 2
+ 2
1 − 2qn′ m (1) (1)
n′ ∈N(m)\n
This paper is organized as follows. The next section ad- Given a (n, k) LDPC code, we assume BPSK modula-
dresses the aspects of the SPA algorithm that are impor- tion which maps a codeword c = (c1 , c2 , · · · , cn ) into the
c
tant to understand the proposed stream-based LDPC de- sequence x = (x1 , x2 , · · · , xn ), according to xi = (−1) i .
coders. The following section mentions properties of the Then, x is transmitted through an additive white Gaussian
stream computing model, while section 4 presents the data noise (AWGN) channel originating the received sequence
structures and the SPA flooding schedule specifically de- y = (y1 , y2 , · · · , yn ), with yi = xi + ni , and ni being a
veloped for stream-based computing. Section 5 reports ex- random variable with zero mean and variance, σ 2 = N0 /2.
perimental results obtained for SPA stream LDPC decoding SPA is depicted in Algorithm 1 [8] and it is mainly de-
on stream-based processors, such as general purpose pro- scribed by two horizontal and vertical intensive processing
cessors with multimedia extensions and graphics processing blocks defined, respectively, by equations (1), (2) and (3),
units (GPUs). Finally, section 6 concludes the paper. (4). The former two equations calculate the message up-
238
date from each CNm to BNn , considering accesses to H arrays before gathering/scattering the data. Moreover, some
in a row basis; they indicate the probability of BNn be- of them are iterative and require dynamic adjustment to be
ing 0 or 1. Figure 1 exemplifies, for a particular 4 × 8 H programmed in a stream-based style. It happens to be the
matrix, BN0 being updated by CN0 and BN4 updated by case of the LDPC decoder based in the SPA algorithm de-
CN1 . It also shows BN6 and BN7 being updated, respec- scribed in the next sections.
(i)
tively, by CN2 and CN3 . For each iteration, rmn values
are updated by equations (1) and (2) according to the edges 4 Stream-based LDPC decoding
defined in the Tanner graph. For the example in figure 1,
r00 = f (q10 , q20 ) and r26 = f (q02 , q32 ).
The H matrix defines the Tanner graph and, thus, the
Similarly, the second major processing block described
flow of messages being exchanged between CNs and BNs.
by equations (3) and (4) computes messages sent from BNn
In medium to large sized LDPC codes a huge amount of
to CNm , assuming accesses to H in a column basis. In this
(i) memory is required to store this matrix. Moreover, in a
case, qnm values hold the update defined by equations (3)
stream computing model this matrix has to be itself gath-
and (4) for the edges connectivity indicated by the Tanner
ered into a data stream. Therefore, we propose to sepa-
graph. Again, in figure 1 it can be seen that q00 = f (r20 )
rately code the information about H in two independent
and q41 = f (r34 ). Thus, equations (1) to (4) define very
data streams adequate to the SPA. These particular auxil-
irregular memory access patterns.
iary data streams establish the order of the data in the input
The iterative procedure is stopped if the decoded word
stream used in the flow of messages to update the BNs and
ĉ verifies all parity check equations and ĉ HT = 0, or a
the CNs. A stream-based SPA for implementing the LDPC
maximum number of iterations is reached.
decoder is represented in the SDF of figure 2.
239
HBN
r0,0 r0,1 r0,2 r1,3 r1,4 r1,5 r2,0 r2,3 r2,6 r3,1 r3,4 r3,7
0 1 0 2 0 0 1 0 1 1 0 3 1 3 2 0 1 2 2 2 2 3 2 1
Figure 3. HBN structure. A row based data stream representing BN edges with circular addressing
in 2D shape.
HCN
(Single 1 in column 2) (Single 1 in column 5) (Single 1 in column 6) (Single 1 in column 7)
q 0,0 q0,1 q 0,2 q 1,3 q1,4 q 1,5 q2,0 q 2,3 q2,6 q3,1 q3,4 q3,7
1 2 2 1 -1 -1 1 3 2 2 -1 -1 0 0 0 3 -1 -1 0 1 1 0 -1 -1
Figure 4. HCN structure. A column based data stream representing CN edges with circular address-
ing in 2D shape.
output data stream feeds the input stream of the next pro- which BNs it should be reading information and where to
cessing kernel in a pipeline organization. update, as well. Circular addressing allows to explore a high
The codeword size (dimension) and rate nk (indicating level of parallelism. In the limit, for a multi-processor plat-
the ratio between parity and message information) imply form, a different processor can be allocated to each single
matrices with quite different sizes. If the quantity of “1’s” is edge, since the starting point becomes no longer important,
constant in all rows and if the same rule applies for columns, due to the circular addressing feature. Although in figure 3
the code is said to be regular, otherwise, it is irregular. The the elements of the structures are represented by their row
latter type of codes is known to have better coding perfor- and column addresses, the structures can be easily vector-
mance than the former but is more difficult to implement ized by considering the rows of the structure in sequence,
[8]. The stream data representation and algorithm proposed using convenient 1D or 2D reshaping according to the tar-
in this paper support both types of regular and irregular get stream-based architecture they apply to.
codes. Compact and efficient dynamic and vectorized data The same principle can be applied to code the informa-
structures HBN and HCN are proposed to represent the tion about CNs. HCN can be defined as a sequential repre-
message updating between connected nodes (BNn , CNm ). sentation of every BN connecting to all its neighboring CNs
Such structures require significantly less memory, since H (non-null elements in a single column of H). Once again,
is sparse, and are suitable for stream-based computing. the access between adjacent elements is circular, as illus-
The example of figure 1 can be used to describe the trans- trated in figure 4 for the H matrix given in figure 1. In
formation performed in H in order to produce the compact this case, a careful construction of the addresses in HCN
data stream structures HBN and HCN . HBN codes infor- is required because every (n,m) edge must be represented
mation about edge connections used in each parity check exactly in the same position as it is in HBN . This meticu-
equation (horizontal processing). This data structure is gen- lous positioning of the elements in HCN allows the iterative
erated by scanning the H matrix in a row major order and by processing of consecutive inputs to be alternately performed
(i) (i)
mapping sequentially only the BN edges associated to a sin- for rmn and qnm , guarantying that data being processed for
gle CN equation (non-null elements in a single row of H). every (n,m) edge is stored in the correct position for both
Each element of the data structure records the address of the situations. In fact, the in place computation here performed
next entry, as a pointer, and the corresponding rmn value. for horizontal and vertical processing is supported by an ef-
The element corresponding to the last non-null element of ficient mechanism that swaps data pointers [14] at the end
each row in H points to the first element of this row, im- of each iteration, according to the input data streams neces-
plementing a circular list that is used to update all the BNs sary for the next processing.
connected to a single CN. Thus, a CN always knows from Figures 3 and 4 are the resulting HBN and HCN struc-
240
. . . H BN HCN
… … …
r0,0 r 0,1 r0,2 k1 k2
HBN 0 r10,0r r0,0 r r0,1
0 r20,1
r
0 r0r0,20,2
r0,0 0,0
0 r r10,0 r0,1 0,1
0 rr20,1 r0,20,2
0 r r00,2 ps0
0 1 0,0 0 2 0,1 0 0 0,2 word n rsi qsi
. Kernel 1 Kernel 2
0 1 0 2 0 0 …
word 2
word 1
The flooding schedule [8] algorithm used controls the or- based architectures can be applied to Kernel 1 and Ker-
der in which BNs and CNs are updated. It guarantees that no nel 2, by simultaneously decoding several codewords (see
BN is updated before all other BNs conclude their updating figure 5). Multimedia extensions on Central Processing
procedure. This principle is adopted in this work because it Units (CPUs) and the multiple pixel processors on Graphics
allows true parallel execution of the SPA decoder. Processing Units (GPUs) are examples of SIMD computa-
The circular addressing property permits that parallel tion machines explored in this paper.
processors can simultaneously access the necessary infor-
mation stored in both HBN and HCN . Consequently, the
4.3 Memory Requirements and Computa-
overall processing time required to decode a codeword can
tional Complexity
be significantly reduced by applying parallel processing.
Figure 5 represents a detail of HBN illustrating such effi-
(i) If a good error correcting capability is required (e.g.,
cient properties for the update of rmn values.
An alternative representation of the SDF in figure 2 is with a Bit Error Rate in the order of 10−8 or lower), large
presented in figure 6. This compact SDF is obtained by matrices should be considered [8]. Although they are
folding the pair of kernels (Kernel 1 an Kernel 2) and by in- sparse, storing such information may require a considerable
troducing an additional multiplexer. Kernel 1 initially pro- amount of memory.
cesses the input stream ps0 according to the BN indexes With the proposed compact data structure, the necessary
defined in the input stream HBN . It produces the output memory can be significantly reduced. In fact, let us consider
stream rs0 by performing the message update from CNm a standard matrix used in DVB-S2 [2] with n = 64800 in-
to BNn . The stream rs0 directly feeds the input of Ker- formation bits (columns) and (n − k) = 16400 parity check
nel 2, which in turn produces the qs0 that represents mes- equations (rows), a rate of 43 and a total number of non-null
sages updated from BNn to CNm . The other input of the elements in H (numEdges) equal to 194400. The number
Kernel 2 is HCN , which defines the CN indexes from H. of elements stored can be reduced from 106.272 × 107 to
Only data streams communicate between kernels. Figure 6 only 194400, which leads to a Storage Compression Ratio
shows that by using the multiplexer at the input of the SDF, (SCR), given by (5), of 2733, less than 0.04% of the original
the output stream qsi becomes the input stream of Kernel 1 necessary memory space.
for all remaining iterations after the first one. At the end,
the qsi output stream conveys the codeword. n × (n − k)
This compact representation can be exploited to imple- SCR = (5)
numEdges × 2
ment efficient LDPC decoders, not only because of the new
compact stream oriented representations of the H matrix At the same time, the overall processing time needed
but also regarding the program memory and the data com- to access all matrix elements is reduced, improving the
munication. The required program memory is reduced by computational efficiency of the decoder. Considering
the folding mechanism and the data streams can be multi- only load/store operations, the full H storage computa-
plexed through a simple and efficient mechanism for swap- tional complexity of the SPA algorithm decreases from
ping pointers. O(n × (n − k)) to O(numEdges), i.e. from quadratic to
Additionally, SIMD processing capabilities of stream- linear complexity with the number of edges.
241
test eax, eax 450
mov edx, DWORD PTR [esi+4] 50 iterations CPU
400 50 iterations CPU SSE
movaps xmm0, xmm6 100 iterations CPU
jle SHORT .loop3 350 100 iterations CPU SSE
npad 6
242
INPUT STREAM 0 ...
x
r0/q0
vec4 qlin0 = vec4(1.0, 1.0, 1.0, 1.0);
pn vec4 rmn0 = vec4(1.0, 1.0, 1.0, 1.0);
INPUT STREAM 2 Flag
vec2 coord0, coordINI;
HCN.x INPUT STREAM 1
HCN.y ...
x HBN.x
weight HBN.y coord0 = texture2D(CaravelaTex0, coord0).yx;
x
weight
for(i=0; i< 7; i++)
MUX
MUX 1rst iteration? {
qlin0 = texture2D(CaravelaTex3, coord0);
x
rmn0 *= (1.0 - (2.0 * qlin0));
if(Flag) else ri/qi ...
{ { pn
//BNs update //CNs update Flag
}
. .
. .
.
.
. .
Figure 10. A code segment showing a BN up-
} }
date on a GPU using OpenGL (GLSL) and
Caravela.
FlowModel
x
ri/qi OUTPUT STREAM 0
pn
Flag
5.2 Stream-based SPA on the GPU
Figure 9. Flowmodel showing input and out- The pixel processors of the GPU were used with data
put data streams for LDPC decoding on a inputted as textures and the output data stream assuming
GPU using recursive computation. the usual place of the pixel color components’ output in a
graphical application. The basic idea underlying the initial
approach adopted to compute the SPA on the GPU places
the input H matrix into a texture and processes Kernel 1 and
Kernel 2 (see figure 6) in the pixel processors. The proposed
when the LDPC decoder uses SSE instructions, as depicted compact representation also allows to reduce data transfer
from matrix 999 × 111 up to 1908 × 212. Above the lat- between the host memory (RAM) and the video memory
ter one, the speedup degrades abruptly, causing a reduction (VRAM), which is a very important aspect to achieve high
on the performance related with the cache. In fact, the time performance with GPUs.
savings introduced by the use of SSE instructions seem to Figure 9 depicts input and output data streams for the
be unable to compensate the increase of cache misses af- program developed for GPU and figure 10 shows a source
ter a certain size of the H matrix. For SSE instructions, code segment of an efficient program that makes use of
four floating-point elements are read/written from/to mem- SIMD multiply (∗) and subtract (−) instructions that op-
ory, against only one element per non-SSE instruction. For erate in four floating-point data elements in the same tex-
this reason, cache misses increase when SSE instructions ture (Red, Green, Blue, Alpha). The shader program was
are used. The same program (LDPC decoder with SSE) was developed in OpenGL (GLSL) an runs on the pixel proces-
experimentally tested in different CPUs having L2 cache sors. We use the Caravela programming interface recently
memory with twice the size of the one used in this work, developed for stream-based computing on GPUs [14]. The
and the results reinforce our previous statements. With a Caravela library is used in the CPU side to manage the exe-
larger cache, the cache misses decrease and a significant cution of the program in the GPU and to write and read data
gain appears even for large matrices using SSE. to and from the VRAM, respectively.
However, it is important to note that the intensity of com- In this subsection, a different implementation explores
putation depends rather on the number of edges, but not GPUs with dedicated architectures for stream-based com-
necessarily on the size of H. For a matrix with dimension puting [10]. The GPU used in this work is a 8800 GeForce
999 × 111 and 2997 edges, the obtained speedup is superior from NVIDIA with programmable pixel processors and
to 3.7. For that matrix, the CPU takes approximately 128ms 768MB of RAM [1].
to decode a codeword performing 100 iterations without Figure 11 shows the total execution times obtained when
SSE instructions, while the version implemented with SSE decoding the five previously mentioned matrices, either in a
executes in only 34ms. For matrices bellow 14688 edges, CPU using SSE instructions, or in a GPU. The GPU imple-
the gain obtained using SSE is a consequence of simultane- mentation not only uses the compact stream data structure
ously decoding four codewords. developed, but also takes advantage of the parallel process-
243
which improved its efficiency. The proposed data struc-
300
tures and algorithms pave the way for future work in the
50 iterations CPU SSE
250 50 iterations GPU
area of stream-based applications dedicated to the intensive
100 iterations CPU SSE decoding of error correcting codes, by exploiting low cost
and highly flexible alternatives against dominating hard-
Decoding Time (ms)
References
100
244