FPGA Implementation of LDPC Decoder Architecture For Wireless Communication Standards

FPGA Implementation of LDPC Decoder
Architecture for Wireless Communication Standards

Ruslan Goriushkin, Pavel Nikishkin, Evgeny Likhobabin
and Vladimir Vityazev, Member, IEEE
Department of Telecommunications and foundations of radio engineering
Ryazan State Radio Engineering University, RSREU
Ryazan, Russia
[email protected], [email protected], [email protected], [email protected]
Abstract—This paper presents a decoder design for Quasi- In this paper we provide the parallel decoder implemen-
Cyclic (QC) Low-Density Parity-Check (LDPC) codes. The design tation for QC-LDPC codes, which meets the requirements of
is parameterized and can be easily rebuilt to support various modern specifications for throughput, and has the flexibility to
LDPC Parity-Check matrices taken from the WiMAX (IEEE
802.16e) and the WiFi (IEEE 802.11n) standards. New techniques work with various code rates and lengths. The proposed LDPC
such as parallelization of the decoding architecture cores are decoder can operate with matrices similar to IEEE 802.11 and
proposed. These cores calculate variable-to-check (VTC) and IEEE 802.16e. Most of the previously proposed architectures
new check-to-variable (CTV) messages and also update posterior do not pay much attention to the way the sparse parity check
probabilities (APPs). The parallel multicore decoding architec- matrix is stored. So, the new way of storing the parity check
ture implies a prior shift of values based on the LDPC matrix and
simultaneous calculation of values for the core. Our decoder is matrix is proposed. It allows to reduce the amount of memory
implemented on FPGAs of the Zynq-7000 Mini-ITX Evaluation needed to store the code matrix.
Board (XC7Z100-2FFG900). The throughput of up to 1,2 GBit/s Moreover, the details and results of implementing the above
and the operation frequency of up to 240 MHz have been mentioned architecture on the FPGA part of the Zynq-7000
achieved. Mini-ITX Evaluation Board (XC7Z100-2FFG900) are pro-
Index Terms—Communication systems, Quasi-Cyclic codes,
Low-Density Parity-Check codes, error correction codes, parallel vided.
decoder architecture, iterative decoding, min-sum (MS) algo- The paper is organized as follows: the next section presents
rithm, FPGA implementation. the overview of the decoding algorithm used herein. The pro-
posed FPGA architecture of the LDPC decoder is described in
I. I NTRODUCTION section III. The hardware implementation results are presented
in section IV. Finally, Section V concludes this paper.
The Low-density parity-check (LDPC) codes concept was
developed by Gallager in 1963 [1]. LDPC codes had been for- II. M IN -S UM D ECODING A LGORITHM
gotten until the Gallager’s work was rediscovered by MacKay Low-Density Parity-Check (LDPC) codes are a class of
and Neal [2] in late 1990’s. Currently, there is a high interest in linear block codes. As the name implies, they exhibit the
LDPC codes due to their excellent error correcting capability linearity property and are applied to source data blocks.
and their performance able to achieve up to 0.0045 dB of Any LDPC code can be represented as a check matrix which
the Shannon’s limit at a BER of 10−6 [3]. These codes have contains a small number of ones and all other elements are
become part of important communication standards such as filled with zeros. A regular (N, K) LDPC code is represented
IEEE 802.11n/ad/ay/ax (WiFi) [4], IEEE 802.16e (WiMAX) as a matrix H which includes M ≥ N − K rows and N
[5], and ETSI 5G [6]. They are mentioned in digital video columns. The M rows of H are defined as check nodes, and
broadcasting standards such as DVB-T2, DVB-S2, and DVB- all the elements of a LDPC codeword as variable nodes.
C2. Nowadays LDPC codes also find place in other fields, The min-sum algorithm (MSA) is a simplified Belief prop-
including error correction techniques for solid state drives agation algorithm (BP) built by reducing complexity of the
(SSD). unit responsible for check node updates (CNU). The following
A number of works devoted to parallel LDPC decoding definitions are given for the MSA algorithm: cn indicates
architectures have been published in the last decade [7] - [10]. the n-th bit of a codeword and yn is a value received from
Only paper [8] presents a universal parametrizable decoder. the channel. The check-to-variable (CTV) message rmn [k]
The decoders proposed in [7], [9], [10] are based only on one is a message from the check node m to the variable node
communication standard or one codeword length. The authors n during the k-th iteration. The variable-to-check message
of the papers widely use BRAMs to store the decoder internal (VTC) qmn [k] is a message from the variable node n to the
values. check node m. The set of variables involved in checking m is
This work was supported by Russian Science Foundation under Grant 17- defined as N (m) and the set of variable n checks is defined
79-20302. as M (n). The set N (m) without the variable n is denoted as
Figure 1. FPGA LDPC decoder structure
N (m)/n and the set M (n) without the check m is denoted III. P ROPOSED D ECODER A RCHITECTURE
as M (n)/m [11]. A. Decoder architecture definitions
The algorithm implementation includes the following steps:
1) Initialization: The general structure of the proposed LDPC decoder is
At this step, the channel probability pn (intrinsic informa- shown in Figure 1.
tion) of the variable node n is calculated based on a priori The soft decisions from the demodulator are used as input
probability: to the decoder. The input serial bitstream is converted into a
parallel and is written to the block memory. For ease of parallel
P (yn |cn = 0) APP reading and writing, the block memory is divided into
pn = log (1)
P (yn |cn = 1) K = N/z blocks. The number of these blocks is determined
The value rmn is initially set to zero. by the length of the codeword (N ) and the size of the circulant
2) Iterative Decoding: (z). The size of each block depends on the number of symbols
The variable-to-check message qmn [k] is calculated at each in the circulant and the width of the input data.
iteration as follows: The Iterations counter block makes a decision on comple-
X tion of the decoding process and output of the result. When
qmn [k] = pn + rm0 n [k − 1] (2) a given number of iterations is reached, its output is set to a
m0 ∈{M (n)/m} high level.
The LDPC decoder makes a hard decision by calculating The Operating mode selection block is used to switch
the a-posterior probability (APP) as follows: the operating mode to ”decoding” or ”result output” and to
organize output of the decoding result. When the high level
is received at the input from the Iterations counter block, the
X
Λn [k] = pn + rm0 n [k − 1] (3)
m0 ∈M (n)
“result output” mode is switched on: a high level is set at
the READY output, and the decoded word is obtained at the
The value of the n-th result bit of the codeword xn is equal
output. Otherwise, the next iteration is to be run.
to 0 when Λn > 0 and xn = 1 otherwise. The decoding
The VTC calculation block is used to calculate the variable-
process is completed when the entire codeword x = [x1 , x2 , · ·
to-check messages based on APPs from the BRAM APP
· · · · xN ] satisfies the parity check equations: Hx = 0, or the
block. A simplified expression for calculating variable-to-
preset maximum number of iterations is reached.
check messages is shown in expression (5).
If the decoding process does not stop, then the check-to-
variable message rmn [k] for the check node m is calculated V T Cn = AP Pn − CT Vn , n = 1..dr , (5)
as follows:
where dr is the matrix row weight.
  A check matrix, which is stored in the distributed memory,
Y is needed to retrieve characters from an APP correctly and to
rmn [k] =  sign(qmn0 [k])
calculate bit-to-variable messages.
n0 ∈{N (m)/n}
The Min/submin calculation block receives dr signed VTC
× α× min {|qmn0 [k]|} (4) values including the minimum and subminimal values at the
n0 ∈{N (m)/n} input. The dr value is determined by the check matrix.
Compared to the BP algorithm, usage of the min-sum The diagram shown in Figure 2 has eight signed data inputs.
algorithm reduces consumption of hardware resources and the This diagram is a modified version of the algorithm from [12].
computational complexity of CNUs [11]. The maximum matrix row weight is limited by the number of
Figure 2. Min/submin calculation block
inputs, it equals to eight in this case. When the line weight is in Figure 3.
less than eight, the remaining inputs are fed with the maximum To calculate the updated CTV values, the input of the CTV
positive values for the current input bitness. The ABS block is and APP calculation block requires the minimum VTC value,
used to calculate modulo values: the input signed data (1−8) is the product of the VTC signs and the sign for the current
converted to the unsigned data (10 −80 ). Also, the ABS module value. The calculation of the updated CTV is presented by
extracts signs from all inputs and sends them to the signs bus. formula (6).
The collection of all signs is passed to the Signs XORing
block, where modulo 2 addition is performed. The search N
algorithm is a cascade of 4 consecutive stages of comparators Y
CT Vnewn = sign(V T Ck ) × sign(V T Cn )
(Compare), connected as shown in Figure 2. In this case the
k=1
maximum row weight is 7. ×min(|V T C|), (6)
The subminimal value is obtained at the output corre-
sponding to the input with the minimum modulo value. The where n is from 1 to dr .
minimum value is obtained at the other outputs. All outputs The updated APPs are obtained by summing the VTCs and
are signed and have the sign obtained from the Signs XORing CTVs, as shown by the formula (7).
block. The VTC input is also passed to the output.
AP Pnewn = V T Cn + CT Vnewn , (7)
where n is from 1 to dr .
B. Storage of the parity-check matrix, the CTV node and the

APP in memory
A H matrix of the QC-LDPC codes taken from the WiFi and
WiMAX standards is quasi-cyclic, so it consists of submatrices
generated by circular shifting of an identity matrix. Only base
parity-check matrixes Hb are stored for QC-LDPC decoder
implementations. It allows to reduce memory consumption for
matrix set storage significantly. The parity check matrices Hb
used herein are specified in [5]. To further reduce the memory
usage and to ease reading of matrix elements, they are stored
in three vectors inside the proposed decoder.
Figure 3. Check node unit calculation block The vectors for storing the matrix are shown below:
• vals is a vector of all matrix elements which is other
The Sign insertion block is designed to add the sign obtained than -1.
at the output of the Signs XORing block to each output of the • pos is a vector containing the matrix column index of
min/submin search diagram. The addition principle is shown each element from vals;
• elementSize is a vector whose element has a value The maximum throughput is achieved when the codeword
containing the number of row elements that are other size is 2304 at the frequency of 192 MHz. The results of
than -1. comparing the proposed decoder with similar ones are shown
Based on the proposed decoder architecture, all the APPs in Table II. The Table II shows that in contrast to [8] the
used by the current row of the parity check matrix are proposed architecture allows to increase the throughput by
updated simultaneously, i.e. the number of cores is equal to increasing the codeword size. It also allows to work with a
the codeword circulant size (z). Therefore, z memory in- wider range of code lengths than [9] and has higher flexibility
stances with the size of (rowW eightP arityCheckM atrix× and throughput compared to [10].
inputDataSize × numberOf RowP arityCheckM atrix)
are needed to store CTVs. TABLE II. Comparison of FPGA Implementations of LDPC
The APP memory uses the dual-port block memory with the Decoders
size of (z×weigth of input data)×(length codeword/z). [8] [9] [10] present work
In the codeword decoding mode, only a part of the blocks is Block Length 576-2304 648-1944 1296 576-2304
read from the BRAM depending on the LDPC matrix. When Frequency, MHz - 200 170 288
the decoding process is completed, all blocks of the block
FPGA Technology Altera Xilinx Xilinx Xilinx
memory are transferred to the decoder output.
Stratix IV Kintex 7 Virtex 7 Zynq-7000
IV. I MPLEMENTATION RESULTS Iterations 10 - - 10
The implementation results show that the decoder can Throughput, Mbps 218,5-68,1 420 102 48,3-140,8
operate with any LDPC matrices of the WiMAX (IEEE
802.16e) and WiFi (IEEE 802.11n) protocols. The decoder
R EFERENCES
can obtain the maximum decoding throughput of 1.2 Gpbs for
one iteration (the codeword length is equal to 2304). [1] R. G. Gallager, ”Low-density parity-check codes”, Cambridge, MA:
M.I.T. Press, 1963.
Table I shows the WiMAX decoder implementation results [2] D.J.C. MacKay, R.M. Neal, ”Near Shannon limit performance of low
such as the number of used LUTs, registers and RAM blocks density parity check codes”, Electron. Lett., vol. 32, no. 18, pp. 1645-
as well as the maximum frequency. The decoding throughput 1646, Aug. 1996.
[3] S. Y. Chung, G. D. Forney, T. J. Richardson, and R. L. Urbanke, “On
is shown for one iteration. the design of low-density parity check codes within 0.0045 dB of the
Shannon limit,” IEEE Commun. Lett., vol. 5, pp. 58–60, Feb. 2001.
TABLE I. Synthesis results for IEEE 802.16e [4] 802.11n-2009 - IEEE Standard for Information technology– Local and
metropolitan area networks– Specific requirements– Part 11: Wireless
Slice Slice Clock Throughput LAN Medium Access Control (MAC) and Physical Layer (PHY) Spec-
n z LUTs Registers BRAM ( MHz ) ( Mbps ) ifications Amendment 5: Enhancements for Higher Throughput, 29 Oct.
2009, DOI: 10.1109/IEEESTD.2009.5307322
576 24 80260 37113 216 240 483 [5] 802.16e-2005 - IEEE Standard for Local and Metropolitan Area Net-
960 40 108761 51964 248 214 703 works - Part 16: Air Interface for Fixed and Mobile Broadband Wireless
Access Systems - Amendment for Physical and Medium Access Control
1440 60 139261 69972 288 189 907
Layers for Combined Fixed and Mobile Operation in Licensed Bands,
1920 80 159265 88463 328 195 1219 28 Feb. 2006, DOI: 10.1109/IEEESTD.2006.99107
2304 96 188147 102877 360 192 1408 [6] 3GPP TS 38.212 version 15.2.0 Release 15 - 3GPP Standard 5G; NR;
Multiplexing and channel coding, Jul. 2018
[7] C. Kun; S. Qi; L. Shengkai; P. Chengzhi Implementation of encoder
At the same time the proposed design may be used to syn- and decoder for LDPC codes based on FPGA, Journal of Systems
Engineering and Electronics ( Volume: 30 , Issue: 4 , Aug. 2019 ),
thesize decoders not only for IEEE 802.16e and IEEE802.11n 642 – 650 pp., 02 September 2019, DOI: 10.21629/JSEE.2019.04.02
standards, but for any QC-LDPC code. [8] P. Hailes; L. Xu; R.G. Maunder; B.M. Al-Hashimi; L. Hanzo A
Flexible FPGA-Based Quasi-Cyclic LDPC Decoder, IEEE Access (
V. C ONCLUSION Volume: 5 ), 20965 – 20984 pps, 03 March 2017, DOI: 10.1109/AC-
CESS.2017.2678103
A parametrisable architecture of a QC-LDPC decoder was [9] S. Mhaske; D. Uliana; H. Kee; T. Ly; A. Aziz; P. Spasojevic A
proposed herein. The parallel decoding architecture was im- 2.48Gb/s FPGA-based QC-LDPC decoder: An algorithmic compiler
plemented using the Zynq-7000 Mini-ITX Evaluation Board implementation, 2015 36th IEEE Sarnoff Symposium, 20-22 Sept. 2015,
DOI: 10.1109/SARNOF.2015.7324649
(XC7Z100-2FFG900). [10] S. Nimara, O. Boncalo, A. Amaricai, M. Popa, FPGA Architecture
The decoding throughput (T) of the LDPC decoder can be of Multi-Codeword LDPC Decoder With Efficient BRAM Utilization,
evaluated as 2016 IEEE 19th International Symposium on Design and Diagnostics
of Electronic Circuits Systems (DDECS), 20-22 April 2016, DOI:
Fmax × BL 10.1109/DDECS.2016.7482452
T = (8) [11] K. Zhang, X. Huang, Z. Wang, ”High-Throughput Layered De-
CP I × N OI coder Implementation for Quasi-Cyclic LDPC Codes”, 2009, IEEE
where Fmax is the maximum clock frequency, BL is the block Journal on Selected Areas in Communications. 27. 985-994. DOI:
10.1109/JSAC.2009.090816.
length, CP I is the number of cycles per iteration and N OI [12] R. Zarubica, S. G. Wilson and E. Hall, ”Multi-Gbps FPGA-Based Low
is the number of iterations. Density Parity Check (LDPC) Decoder Design,” IEEE GLOBECOM
The decoder implementation results are shown for different 2007 - IEEE Global Telecommunications Conference, Washington, DC,
USA, 2007, pp. 548-552, doi: 10.1109/GLOCOM.2007.108.
matrices in Table I.

FPGA Implementation of LDPC Decoder Architecture For Wireless Communication Standards

Uploaded by

Copyright:

Available Formats

FPGA Implementation of LDPC Decoder Architecture For Wireless Communication Standards

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FPGA Implementation of LDPC Decoder Architecture For Wireless Communication Standards

Uploaded by

Copyright:

Available Formats

FPGA Implementation of LDPC Decoder

Architecture for Wireless Communication Standards

AP Pnewn = V T Cn + CT Vnewn , (7)

B. Storage of the parity-check matrix, the CTV node and the

You might also like