TCC Decoder v2 1
TCC Decoder v2 1
TCC Decoder v2 1
Product Specification
Features
Drop-in module for Spartan-3, Spartan-3E, Spartan-3A/3AN, Virtex-II, Virtex-II Pro, Virtex-4, and Virtex-5 FPGAs Implements the CDMA2000/3GPP2 specification [1]. Core contains the full 3GPP2 interleaver Full range of 3GPP2 block sizes supported, (122-12282) Core implements the MAX*, MAX, or MAX SCALE algorithms Dynamically selectable number of Iterations 1-15 Number representation: twos complement fractional numbers Data input: 2 to 5 integer bits and 1 to 4 fractional bits Internal Calculations: 6 to 9 integer bits and 1 to 4 fractional bits Sliding window size of 32 or 64 Works with all 3GPP2 code rates Internal or external RAM data storage To be used with Xilinx CORE Generator system
Provided by Xilinx, Inc.
Instantiation Template Supported Device Family
Licensing
Pay Core. Requires a full or evaluation license
Support
Applications
This version of the Turbo Convolution Code (TCC) decoder is designed to meet the 3GPP2 mobile communication system specification [1].
2006-2007 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners. Xilinx is providing this design, code, or information "as is." By providing the design, code, or information as one possible implementation of this feature, application, or standard, Xilinx makes no representation that this implementation is free from any claims of infringement. You are responsible for obtaining any rights you may require for your implementation. Xilinx expressly disclaims any warranty whatsoever with respect to the adequacy of the implementation, including but not limited to any warranties or representations that this implementation is free from claims of infringement and any implied warranties of merchantability or fitness for a particular purpose.
www.xilinx.com
General Description
The TCC decoder is used in conjunction with a TCC encoder to provide a reliable, extremely effective way to transmit data over noisy data channels. The Turbo Decoder core operates very well under low signal-to-noise conditions and provides a performance close to the theoretical optimal performance as defined by the Shannon limit [2]. When a decoding operation is started, the core accepts the block size and the number of iterations from two input ports. The systematic and parity data is read into the core in parallel on a clock-by-clock basis. The core then starts the decoding process and implements the required number of decode iterations. Finally, the decoded bit sequence is output. The entire sequence is automatically controlled from a single first data signal and requires no user intervention. In addition, all the interleaving operations required in the 3GPP2 specification are handled automatically within the core. The core expects twos complement fractional numbers as inputs and also uses this format for the internal calculations. Each fractional input number represents the Log Likelihood Ratio (LLR) divided by 2 for each input bit. This LLR value can be considered to be the confidence level that a particular bit is a one or zero. The user can trade off accuracy against speed and complexity by selecting the numerical precision that is required. The input data can have 2 to 5 integer bits and between 1 and 4 fractional bits. The precision of the internal calculations can also be controlled with 6 to 9 integer bits and between 1 and 4 fractional bits. The number of internal integer bits must be greater than the number of input integer bits by 3 or more and the number of input fractional bits must be less than or equal to the number of internal fractional bits.
Algorithm Type
The full TCC decoder algorithm is extremely computational and, therefore, approximations must be made to make the algorithm usable in practice. The approach taken here is to provide the user with three algorithm choices: 1. MAX*. A very good algorithmic approximation used when accuracy, rather than algorithm simplicity, is required. BER performance of this approach is the best of all three algorithms although this increases core complexity and resource requirements. In this algorithm, a small lookup table is used to increase the accuracy of some non-linear operations. MAX. Produces lower BER performance than the MAX* algorithm, but provides the advantage of being less complex and, therefore, requires fewer resources. In this case, the lookup table is not used, which reduces the algorithm accuracy and subsequently produces a slightly degraded BER performance (approximately 0.5 dB compared to the MAX* algorithm). MAX SCALE. Produces BER performance very close to the MAX* (within approximately 0.1 dB to 0.2 dB) but with the complexity of the MAX algorithm. If the small reduction in BER performance is acceptable, this provides the best BER performance/resource requirement trade-off. Reference [3], Improving the MAX Log MAP Turbo Decoder, describes this approach in greater detail.
2.
3.
Sliding Window
A commonly used technique to reduce the resource requirements of the core is the use of a sliding window in the calculations. As the sliding window only stores a subset of the entire data set at any one time, the memory requirements are significantly reduced. Two sliding window sizes can be used with the core: 32 or 64.
www.xilinx.com
Code Rates
The core operates with all the different code rates of the 3GPP2 specification and always assumes that rate 1/5 data is used as input. For different code rates, the appropriate parity bits in the sequence are replaced by zeros, allowing the core to implement any puncturing scheme.
Input/Output Pins
Signal names are shown in Figure 1 and described in Table 1.
Figure Top x-ref 1
bus input/output
single-bit input/output
RFFD RFD
RDY DOUT
S_ADDR P_ADDR WR_D_OUT WE
RD_D_IN
Pin
Direction
Description
First Data When asserted (High) on a valid rising clock edge, the decoding process is started. Qualified by ND when present. New Data When asserted (High) on a valid rising clock edge, a new input value is read from the DIN port. Ignored when RFD deasserted. Asynchronous Clear When asserted (High), the decoder asynchronously resets. Synchronous Clear When asserted (High) on a valid rising clock edge, the decoder is reset. Clock Enable When this is deasserted (Low), rising clock edges are ignored and the core is held in its current state. A rising clock edge is only valid when CE is asserted (High). Clock All synchronous operations occur on the rising edge of the clock signal. Iterations The number of iterations that the core must implement. Read on a valid FD_IN assertion (High). Block Size Select The block size of the current decode operation. Read on a valid FD_IN assertion (High).
FD_IN
Input
ND
Input
ACLR SCLR
Input Input
1 1
CE
Input
1 4 4
www.xilinx.com
Pin
Direction
Description
DIN
Input
Data Input Consisting of the systematic and parity data input. Read Data In Only used with an external memory interface. The systematic and parity data are read into the core from external memory using this port. Ready For First Data When asserted (High), the core is ready to start another decoder operation. Ready For Data When asserted (High), the core is ready to accept data on the DIN port. Ready When asserted, the data on the DOUT port is valid. This indicates that the entire decode operation is complete. Systematic Address Only used with an external memory interface. External RAM address to control the reading and writing of the systematic data. Parity Address Only used with an external memory interface. External RAM address to control the reading and writing of the parity data. Data Out Decoded output data from the core. Qualified by RDY. Write Data Out Only used with an external memory interface. The systematic and parity data are written out to external memory from the core using this port. Write Enable Only used with an external memory interface. Indicates that the data on the WR_D_OUT port is valid and needs to be written to external memory.
RD_D_IN
Input
S_ADDR
Output
14
WE
Output
Functional Description
Clock: CLK
All operations of the core are synchronized to the rising edge of the CLK signal. If the optional CE pin is present, a rising clock edge is only valid when CE is High; if CE is Low, the core is held in its current state.
Clock Enable: CE
CE is an optional pin used to indicate if the next rising clock edge is valid. When CE is High, rising clock edges are valid and allow the decoding process to continue. If CE is Low, the core operations are suspended and the core remains in its current state. All synchronous signals are ignored when CE is Low.
www.xilinx.com
ations to be implemented for this decode operation. FD_IN should only be held High for a single valid clock cycle. If optional pin ND is present then FD_IN is only valid when ND is High for the same valid clock edge.
New Data: ND
The ND signal is optional and is used to indicate that there is new input data to be read from the DIN port. For example, if ND is required and the input block size is 122, then 122+6 (tail bits) active High ND assertions are required to load in the complete block before the decoding operation commences. After all the expected input data has been read into the core, the ND signal is ignored until the next decoding block is started.
ITERATIONS
The 4-bit input port represents the number of iterations with valid values from 0001-1111 (binary) or 1-15 (decimal). The ITERATIONS port is read when a valid FD_IN occurs. The value read defines the number of iterations to be implemented for that blocks decode operation.
Block Size
122 250 506 762 1,018 1,530 2,042
BLOCK_SIZE_SEL (binary)
0000 0001 0010 0011 0100 0101 0110
www.xilinx.com
Block Size
3,066 4,090 5,114 6,138 8,186 12,282
BLOCK_SIZE_SEL (binary)
0111 1000 1001 1010 1011 1100
www.xilinx.com
Ready: RDY
This signal is asserted after completing the number of iterations defined by the ITERATIONS port on the valid FD_IN signal. RDY is asserted High to indicate that the data on the DOUT port is now valid and forms the final result of the decode operation. RDY is always asserted for a number of valid clock cycles equal to the size of the current block being decoded.
Write Enable: WE
This port is only used where an external memory interface is required. When the WE signal is High, there is valid data on the WR_D_OUT port that must be written to memory.
Turbo Encoder
The data into the DIN port of the Turbo Decoder core must be generated by a TCC Encoder core that provides the correct data format, such as the 3GPP2 Turbo Encoder v2.0 supplied by Xilinx. A brief description of the data output requirements of the Turbo Encoder core is provided here for the purpose of identifying the input requirements of the decoder core. Figure 3 shows the basic structure of the Turbo encoder. It consists of two identical Recursive Systematic Convolution (RSC) encoders: one processes the original input data, and the other processes an
www.xilinx.com
interleaved version of the original input data. As a general rule, the original input is delayed by the latency of the interleaver, so that the first and successive outputs from both RSCs are synchronized and occur on the same clock cycle. The output from each RSC consists of the original input bit or systematic bit and two parity bits that are created by the circuit shown in Figure 1. Some of the RSC bits are not transmitted depending on the selected encoding rate. For example, in rate 1/5 for every one input bit, five output bits are generated. Figure 4 shows that there is a control at the RSC input that switches between new data input and a feedback input. When block_size values have been output from the encoder, these control switches are switched over to create tail bits, which are used to force the RSCs to a known state. Each of the two RSCs create three sets of soft values during the tail bit generation. The output of the Turbo Encoder (and the input to the Turbo Decoder) always consist of block_size+6 sets of soft values. See the 3GPP2 specification for more details.
Figure Top x-ref 3
RSC1_systematic
data_in
Delay
RSC1_parity0
RSC1
RSC1_parity1
Puncture
RSC2_systematic
Interleaver
RSC2
RSC2_parity0 RSC2_parity1
www.xilinx.com
www.xilinx.com
10
Figure Top x-ref 5
DIN Port
RSC1_syst(120) RSC1_syst(121) RSC1_syst(T0) RSC1_P0(120) RSC1_P0(121) RSC1_P0(T0) RSC1_P0(T1) RSC1_P1(T1) not used not used RSC1_P1(121) RSC1_P1(T0) RSC2_P0(121) not used not used RSC2_P1(121) RSC1_P1(120) RSC2_P0(120) RSC2_P1(120) RSC1_syst(T1) RSC1_P0(T2) RSC1_P1(T2) not used not used
Figure 5: Data and Tail Bit Input Sequence into the Decoder Core
www.xilinx.com
This is the natural logarithm of the probability that a received symbol is a one or zero. The actual required input of the Turbo Decoder core is LLR(x)/2. Knowledge of the input data format and the noise characteristics is required to calculate LLR(x) accurately. The Turbo Decoder still functions if an estimate of the LLR is made, but best performance is obtained with an accurate calculation of the LLR. Assume that there is a Binary Phase Shift Keying (BPSK) input signal, x, where a value of 1.0 represents a transmitted logic 1, and -1.0 represents a transmitted logic 0. Also, assuming that the input has been corrupted by Random Gaussian noise with a mean and a variance 2 the LLR(x) can be calculated from ( x 2 ) 1 ----------------- exp ------------------- 2 2 2 2 LLR ( x ) = ln ---------------------------------------------------------- 1 ( x + 2 ) ----------------- exp -------------------- 2 2 2 2 This produces: 2x LLR ( x ) = --------2 meaning that the required input into the DIN port of the decoder is given by: x LLR ( x ) DIN = ------------------- = -----2 2 For a mean of 1, the input to the DIN port of the decoder is simply the values output from the encoder divided by the variance. If for example, the mean of the input signals is 3 instead of 1, the input values are simply divided by 3 so that they are rescaled to have a mean of 1. This ensures optimal operation of the Turbo Decoder.
www.xilinx.com
11
1.8 dB, and 3 dB for rates 1/2, 1/3, and 1/4, respectively. For example, if the user requires the decoder to operate at different Eb/No values, the user can either calculate the variance value in real time or use Figure 6 to provide an estimate of the variance. For example, if the decoder is expected to operate when Eb/No is around 2 dB, then appropriate variance values are 0.63, 0.95, and 1.25 for rates 1/2, 1/3, and 1/4, respectively. For the purposes of this data sheet, when Bit Error Rate (BER) figures are quoted as using scaled values, it is assumed that the noise variance is accurately known. The less accurate the noise variance of the input, the greater the degradation in BER performance.
Figure Top x-ref 6
12
www.xilinx.com
single bi-directional data port on the memories, rather than the separate ports shown in Figure 7. These simple changes are left for the user to implement, as they will be specific to the design. Note that the systematic memory is only 5 bits wide in this example, compared to the parity memory which is 20 bits wide. The parity memory is always 4 times as wide as the systematic memory due to the fact that there are four parity input channels and only one systematic. The maximum block size in the 3GPP2 specification is 12288 (including 6 tail bits), giving a total memory requirement for this example as follows: Systematic Memory requirement = 5 bits x 12288 = 61,440 bits Parity Memory = 4 channels x 5 bits x 12288= 245,760 bits
It is important to note that the systematic memory is read in a linear and an interleaved sequence. This memory block must therefore be capable of true random access at the full system clock rate. Parity memory is always addressed as an increasing count from zero to block size + tail bits.
Figure Top x-ref 7
TCC_DECODER S_ADDR WR_D_OUT[24:20] RD_D_IN[24:20] WE WE P_ADDR WR_D_OUT[20:0] RD_D_IN[20:0] ADDR READ_DATA WRITE_DATA Parity RAM ADDR READ_DATA WRITE_DATA WE Systematic RAM
Signal Timing
The Turbo Decoder core is a synchronous core operating on the rising edge of the clock. All input signals are read and all output signals can be changed on the rising edge of the clock. The only exception to this is the asynchronous clear signal, ACLR. When an optional CE signal is used, the core state does not change when CE is Low; all input signals are ignored and the core outputs remain the same. If the optional CE signal is not used, the core operates as though CE is permanently High (enabled). Figure 8 shows the input timing for the decoder. The data input process is started when FD_IN, CE and ND are all High on a rising clock edge. On receiving a valid FD_IN pulse, the RFFD signal goes Low to indicate that the core is no longer ready to receive a first data pulse. RFFD remains Low until the core is ready to process another block of data. The first input data, d0, is read from the DIN port on the same clock edge as the valid FD_IN pulse is detected (Figure 8). At the same time, the ITERATIONS and BLOCK_SIZE_SEL inputs are read to determine the size of block to be processed and the number of decode iterations to be implemented. The core will read the next input data values on successive rising edges of the clock unless CE or ND is Low, in which case the input data is ignored. During the data input process the RFD signal remains High to indicate that the core is ready to accept further input data.
www.xilinx.com
13
As shown in Figure 9, at the end of the input cycle, after BLOCK_SIZE+6 data values have been input, the RFD signal will go Low to indicate that all input data has been read. Once RFD is Low further ND signal changes are ignored. The RFD signal going Low also indicates that the core is moving from its input to its decoding phase.
Figure Top x-ref 9
After the decoder has performed the required number of iterations, the RDY signal is driven High to indicate that there is valid data on the DOUT port (Figure 10).
Figure Top x-ref 10
When the core approaches the end of the data output phase, it takes RFD and RFFD High to indicate that the core is ready to accept a new block. These signals go High before the last data has been output to maximize throughput.
14
www.xilinx.com
BER Performance
The effect of different block sizes, rates and other parameters on the BER performance of the core has been measured and plotted against Eb/No. This allows the designer to determine the optimum trade off between core performance and complexity. Table 3 indicates which parameter is varied in each of the BER Performance Plots shown in Figure 11 through Figure 15.
Table 3: BER Performance Plots
The core configuration, code rate, block size used, and number of iterations implemented for each trace in the BER plots are identified in the legend. For example: 1r3,2i3,6m3,scale,w32,bs122,i5 Where: 1r3 = code rate 1/3 2i3 = 2 input integer bits and 3 fractional input bits 6m3 = 6 metric integer bits and 3 fractional metric bits scale = max scale algorithm (alternatively, star = max star algorithm) w32 = window size of 32 (alternatively, w64 = window size of 64) bs122 = block size of 122 excluding tail bits i5 = 5 iterations These results have been generated in hardware using a Virtex-4 device. The device was configured with a setup consisting of an encoder, noise channel, and the decoder. Additional logic in the FPGA was used to record both the throughput and the bits in error between the input data to the encoder and the output data from the decoder. The input data shown in the plots has been scaled as described in the Data Input Format section.
www.xilinx.com
15
Case
A B C D
LUTs
2,985 3,204 3,474 3,886
Flip-Flops
1,617 1,678 1,743 2,135
Hardware Multipliers
4 4 4 4
Block RAMs
27 27 27 43
Notes: 1. Area and maximum clock frequencies are provided as a guide. They may vary with new releases of Xilinx implementation tools, etc. 2. Clock frequency does not take clock jitter into account and should be derated by an amount appropriate to the clock source jitter specification.
Table 5: Performance and Resource Requirements for Virtex-5 XC5VSX35 (speed grade -1)
Case
A B C D
LUTs
2,496 2,572 3,060 3,318
Flip-Flops
1,541 1,590 1,658 2,030
Hardware Multipliers
4 4 4 4
Block RAMs
15 15 15 23
Notes: 1. Area and maximum clock frequencies are provided as a guide. They may vary with new releases of Xilinx implementation tools, etc. 2. Clock frequency does not take clock jitter into account and should be derated by an amount appropriate to the clock source jitter specification.
The core latency in clock cycles is given using the following formula: Core Latency = (window_size+3)+2*iterations*(2*window_size+19+block_size) The core latency is defined as the number of clock cycles measured from the first input to the core (that is, when FD_IN is active) to the first decoded data output from the core (when RDY is asserted). RFFD is asserted prior to RDY being deasserted indicating that another blocks decode operation can be initiated before the previous blocks data has been completely output.
16
www.xilinx.com
The number of clock cycles between successive decode operations is given by the formula: Clock cycles between blocks = Core Latency + block_size - 109 - 3*(window_size-32) Given the number of clock cycles defined in the above equation and the maximum clock frequency of the core, it is possible to calculate the throughput achievable. Table 6 gives some typical values for throughput of the core for a selection of block sizes and numbers of iterations. These assume a window size of 32 and a performance equal to that of a Virtex-5 XC5VSX35 device implementing Case A (150 MHz from Table 5).
Table 6: Throughput Rates (Mbits/s)
Block Size
122 506 762 1,530 3,066 6,138 12,282
3 iterations
14.3 19.1 19.8 20.6 21.0 21.2 21.3
5 Iterations
8.7 12.0 12.5 13.0 13.3 13.5 13.6
7 Iterations
6.3 8.7 9.1 9.5 9.8 9.9 9.9
9 Iterations
4.9 6.9 7.2 7.5 7.7 7.8 7.8
11 Iterations
4.0 5.7 5.9 6.2 6.4 6.4 6.5
www.xilinx.com
17
18
www.xilinx.com
www.xilinx.com
19
20
www.xilinx.com
References
1. 2. 3. High Rate Packet Data Air Interface Specification Version 1.0, 3GPP2 C.S0024-B CDMA2000. Near Shannon Limit Error-correcting Coding and Decoding Turbo Codes, C Berrrou, A. Glavieux, and P Thitimajshima, IEEE Procedures 1993, International Conference Committee, pp1064-1070. Improving the MAX Log MAP Turbo Decoder, J. Vogt and A. Finger, Electronics Letters 9th November 2000, Volume 36 No. 23, pp1937-1939.
www.xilinx.com
21
Ordering Information
This Xilinx LogiCORE product is provided under the terms of the SignOnce IP Site License. For additional information about the core and how to obtain a license for the core, please see the TCC Decoder product page. For pricing and availability of Xilinx LogiCORE products and software, please contact your local Xilinx sales representative or visit the Xilinx Silicon Xpresso Cafe. France Telecom, for itself and certain other parties, claims certain intellectual property rights covering Turbo Codes technology, and has decided to license these rights under a licensing program called the Turbo Codes Licensing Program. Supply of this IP core does not convey a license nor imply any right to use any Turbo Codes patents owned by France Telecom, TDF, or GET. Please contact France Telecom for information about its Turbo Codes Licensing Program at the following address: France Telecom R&D, VAT/TURBOCODES 38, rue du Gnral Leclerc 92794 Issy Moulineaux Cedex 9.
Revision History
The following table shows the revision history for this document. Date
12/11/03 04/22/04 04/28/05 02/15/07
Version
1.0 1.1 2.0 2.1 Initial Xilinx release. Added further performance data. Added support for Spartan-3E. Updated for version 2.1.
Revision
22
www.xilinx.com