Performance Evaluation of Galois Field Arithmetic Operators For Optimizing Reed Solomon Codec
Performance Evaluation of Galois Field Arithmetic Operators For Optimizing Reed Solomon Codec
Performance Evaluation of Galois Field Arithmetic Operators For Optimizing Reed Solomon Codec
Petrus Mursanto
Faculty of Computer Science University of Indonesia Kampus UI, Depok 16424 Phone: +62-21-7863419 Email: [email protected]
Abstract-A series of experiments has been conducted to show that efficiency improvement in Galois Field (GF) operators does not directly correspond to the system performance at application level. The experiments were motivated by so many research works focusing on performance improvement of GF operators. Numerous variants of operators were formed based on various combination of operation types (multiplication, division, inverse, square), representation basis (Polynomial, Normal, Dual), and processing types (serial, parallel). Each of the variants has the most efficient form in either time (fastest) or space (smallest occupied area) when implemented in FPGA chips. In fact, GF operators are not utilized individually, rather integrated one to the others in implementing algorithms, mostly in error correction codes and cryptography applications. The experiments based on the implementation of Reed Solomon Encoder and Decoder RS(15,11) 4-bit using VHDL by means of two synthesis tools the Xilinx ISE 8.2i and the Altium ProChip Designer concludes that application performance mainly depends on the composition and distribution of the operators as well as their interaction and interconnection within the system architecture. Keyword: Galois Field, Reed Solomon, VHDL, FPGA.
II. PREVIOUS RESEARCH Similar to the ordinary algebra, GF algebra has a number of arithmetic operations, such as: addition, subtraction, multiplication, division, inversion, square and square root. Variants of GF arithmetic operators are characterized by: 1. operation types: multiplication, division, inversion, square or square root. 2. representation basis: standard/polynomial (PB), normal (NB) or dual (DB) 3. processing types: serial or parallel In digital circuit, GF addition and subtraction are simply implemented by exclusive-OR logic operation. The advance of digital technology has shifted performance metric from running time of software algorithm [9] to VLSI complexity, i.e. the number of components and their total delay [6]. A. Performance Improvement of GF Operators The first circuit structure of GF arithmetic was proposed by Berlekamp in 1982, i.e. polynomial and dual based multiplication [10]. Normal based multiplier was introduced firstly by Massey-Omura in 1986 [11], which is known afterward as MO multiplier. In 1988, Mastrovito proposed a more modular multiplier with higher regularity of the structure that suits systolic cells in VLSI [12]. However, speed, size and modularity of Mastrovito's multiplier depend much on the irreducible polynomial P(x) used to generate the field elements. By selecting the right P(x), parallel multiplication has at most 2m2-1 gates and occupies 55% of the space required for implementing Bartee and Schneider's algorithm [9]. In 1991, Mastrovito's dissertation reported an experimental investigation on multiplication using more than one representation basis [13]. It was concluded that PB multiplier is the most versatile form for the most arithmetic computational problems GF(2m) in VLSI. In addition, PB solution also posses conversion cost that can compensate the efficiency gained by the other representation basis [14] and occupies a half space of the one required by MO multiplier. Mapping problem for inter-basis conversion is the concern of Wu et al. [15] who introduced an efficient conversion method from PB to NB specifically for squaring. Furthermore, Sunar et al. proposed conversion matrixes for any form of generator polynomial [16].
I. INTRODUCTION Galois Field (GF) arithmetic plays an important role in modern communication system, particularly in two important aspects of information exchange, i.e. security and data correctness. GF is utilized in cryptography algorithm [1][2] and error correction codes (ECC) [3][4]. Performance of applications in these two fields is determined by the efficiency of GF arithmetic operators involved in the system [5]. There has been found in the literatures, research efforts in improving GF operators efficiency, i.e. multiplication [6], division [7] and inversion [8]. In fact, GF operators are not performing their functions individually and independently, rather they are parts of a functional integration at the system level. Is operator efficiency beneficial to the application level performance? This paper reports an experimental result of implementing Reed Solomon Encoder and Decoder (Codec) RS(15,11) 4-bit based on six variants of GF operator. The purpose of the experiment is to obtain an RS Codec configuration whose throughput is the most optimum. The RS Codec was implemented using VHDL by means of two synthesis tools: the Xilinx ISE 8.2i and the Altium ProChip Designer.
Several improvements of multiplication algorithm were also reported by Afanasyev [17] and similarly by Hasan et al. [18] that proposed a modification of the architecture by defining the irreducible polynomial as all-one polynomial (AOP). By applying the AOP, Hasan claimed the complexity of multiplication decreases by 50%. Meanwhile, Lee-Lim also reported performance improvement by applying circular dual basis (CDB) [19]. Lee's method is very efficient for trinomial with composite GF((2n)m) where m is primary relative over n, or gcd(m,n) = 1. However, defining certain form of irreducible polynomial is considered as limitation, inflexible and low reusability [20]. A comprehensive study on GF arithmetic was reported by Paar's dissertation [21], in which he proposed a decomposition algorithm from GF(2k) to GF((2n)m) where k = n.m, called composite field. In addition, Paar also explored inversion after the first algorithm introduced by Itoh-Tsujii in 1988 [22]. Further Paar's research reported composite field multiplication and inversion in GF(28) [23]. Composite field implementation in FGPA showed component saving by 25% and acceleration by 10% [24]. The composite field inversion requires 29% of AND and XOR gates compared to the standard one. Rudra [25] and Jutla [26] also developed a method for linear transformation of GF binary elements to composite field representation. Combination of serial and parallel processes were reported by Choi et al. [27] who introduced hybrid multiplier by forming irreducible polynomial xm + xn + 1 where n m/2. This hybrid multiplier has flexible structure to compromise space and time complexity and is proven having less complexity than Wu and Hasans multiplier [15]. Several other methods were proposed to support hardware implementation, such as Huang-Wu [28] that has systolic array architecture approach to ease the testing process. B. RS Codec Implementation RS Codec algorithm has been discussed extensively in term of software algorithm running times. There are two references mentioning hardware implementation of RS Codec, they are [29] and [30]. However, there is no analytical discussion from those references, but merely on experience sharing with very specific characteristics. Kamran Saleh managed to develop RS(15,11) 4-bit that has processing speed 692 Mbps [29]. Shayan et al. developed RS(16,12) 5-bit for the Advanced Train Control System (ATCS) radio data link [30]. The RS device has total delay (encode and decode) 1.32ms in 2 MHz system. The available RS Codec applications are merely commercial IP Core with specific interface and features, such as [31], [32] and [33]. Component compositions, operator types and algorithms involved are kept secret due to its proprietary.
III. MOTIVATION Previous research focused on efficiency improvement of GF operators to obtain better performance in term of speed or occupied space when implemented in digital circuits. However, literatures on GF-based implementation at system level are merely experience sharing with specific features without any analysis on the consequences of GF operator variants involved in the system. Based on the available literatures, the experiments were designed to answer the following research questions: Can performance improvement gained at operator level be obtained linearly at the application level? How can GF-based circuit optimization be achieved at application level? Will an application take benefit by employing all best variants of GF operators? IV. METODOLOGY A set of experiments was designed to examine whether the best variants of GF operator can be combined to construct the most efficient application. In other words, what configuration of GF operator variants can produce an application with the greatest throughput? Arithmetic operator types were taken as suggested by the algorithm of encoder and decoder for RS(15,11) 4-bit. The operators were implemented as the most efficient variant from each combination of two parameters: representation basis (PB, NB or DB) and processing structure (parallel or serial). For further reference in the next discussion, we use six variants of operators as shown in Table I.
TABLE I GF OPERATOR VARIANTS Varian 1 2 3 4 5 6 Structure Parallel Serial Parallel Serial Parallel Serial Basis Polynomial Normal Dual
The RS Codec is implemented into six versions, each of which was constructed based on the variants in Table I. They were implemented using structural VHDL and synthesized with Xilinx ISE 8.2i and Altium ProChip Designer. The synthesis process results in maximum combinational path or total delay that defines the maximum frequency possibly supplied to the system. Hence, the system throughput can be calculated based on data capacity proceeded per time unit, expressed in Mega Byte per second (MBps). Throughput is then used as an indicator of the system performance. An optimal configuration is defined as the one having biggest throughput among the six versions of the system.
V. GF OPERATOR ARCHITECTURES This section briefly discusses the six variants of GF operator, particularly the ones widely used in encoding and decoding process, i.e. the multiplication. Variant 1 of multiplier implements Mastrovito's circuit in [12]. Multiplication is proceeding partially in variant 2 by the circuit shown in Fig.1. Variant 3 and 4 are realized based on the NB multiplier in [13]. Multiplier variant 5 and 6 are implemented using parallel and serial structure in [13].
result reports a minimum period as shown in Table III. As consequences the throughput obtained by each version of RS Encoder are shown in Table IV.
PB 2,50 2,11
NB 3,80 3,69
DB 2,76 2,55
TABLE IV RS ENCODER THROUGHPUT IN MBPS Fig. 1. Serial PB Multiplier GF(2m). Structure Parallel Serial Tools Xilinx Altium Xilinx Altium PB 36,00 40,30 39,30 46,52 NB 33,95 38,39 25,90 26,60 DB 26,00 32,20 44,25 48,00
The implementation of six multiplier variants results in delays shown in Table II. It can be seen that the parallel multiplier in dual basis has the biggest combination delay. The best variant is the one having the smallest delay, i.e. polynomial based parallel multiplier or variant 1.
TABLE II DELAY OF GF(24) MULTIPLIER IN NS Structure Parallel Serial Tools Xilinx Altium Xilinx Altium PB 12,69 11,37 15,96 15,24 NB 13,49 11,94 15,18 16,26 DB 17,61 14,25 15,31 15,14
It is shown that RS Encoder has the biggest throughput by using multiplier variant 6, i.e. dual based serial structure. In fact Table II shows that the fastest multiplier is performed by variant 1, i.e. polynomial based parallel multiplier. VII. RS DECODER The main function of decoding process is to detect the existence of error and make correction of it (if any). The configuration of RS(15,11) allows the maximum number of error (t) exists in 2 symbols. The decoding process has main process as shown in Fig.3.
VI. RS ENCODER The Reed Solomon encoder circuit is shown in Fig.2. Based on the same structure, six versions of RS Encoder were implemented, each of which employs multiplier from the variants in Table I and their performance were compared each other. Addition in encoding process can easily be implemented using XOR, bit-serially as well as in parallel. Therefore, performance is totally affected by the delay of each multiplier type in the encoder. The RS(15,11) 4-bit encodes the information of 11 symbols @4-bit or in total 44 bits. Number of cycle required for the encoding process is 1 cycle initialization plus 11 cycles for parallel multiplication. The serial PB and NB require 1 (initialization) + 11 (1 init + 4 bit) = 56 cycles. Specific for serial DB, 1 (initialization) + 11 4 bit = 45 cycles. Synthesis
Fig. 3. RS Decoder.
The implementation and evaluation results of each processing blocks are discussed in the following sessions. A. Syndrome Calculator Syndrome Calculator is to detect the existence of error in the transmitted codewords. If the values of syndromes are all zero, then r(x) is the codewords c(x) or in other words there is no error. Error produces a set of syndromes with specific pattern. Number of syndrome to be calculated is 2t, hence in RS(15,11) there will be four syndromes (Si). Syndromes are calculated by serial entry i for 1 i 4 to the received polynomial r(x). Since the data is transmitted per symbol, hence the whole data should be received first before computing Si in parallel. The delay and throughput of syndrome calculator for the three representation basis are shown in Table V.
TABLE V DELAY AND THROUGHPUT OF FULLY PARALLEL SYNDROME CALCULATOR Tools Xilinx Altium Unit Delay (ns) Thrghpt (MBps) Delay (ns) Thrghpt (MBps) PB 15,29 32,69 13,70 36,50 NB 16,67 29,99 14,75 33,90 DB 15,14 32,20 12,25 40,80
and represented by four coefficients of the error locator polynomial = 4 4 = 16 bit. Table VIII shows the module performance with six variants of multiplier and square.
TABLE VIII ERROR LOCATOR & MAGNITUDE POLYNOMIAL Structure Parallel Serial Tool Xilinx Altium Xilinx Altium PB 157,60 175,90 160,00 189,50 NB 148,20 174,00 105,40 108,34 DB 113,58 140.35 181,02 196,50
D. Error Corrector Error correction is conducted by summing (XOR) Error Magnitude Polynomial with the received data. Delay for the three representation basis is the delay of XOR gates. In Xilinx: delay = 2,9 ns and throughput = 2,586 GBps, whereas in Altium: delay = 2,7 ns and throughput = 2,778 GBps. E. Optimal RS Decoder Practically the overall RS Decoder performance requires a uniform representation basis for the whole symbols. It has been learnt that the speed of inter-module sequential processes is defined by the Syndrome Calculator. The fastest error locator in DB is slowed down by the Syndrome Calculator performance. In summary, Table IX to Table XII show the best throughput of the module from each representation basis.
TABLE IX THE BEST SERIAL SYNDROME CALCULATOR Tool Xilinx Architecture Serial PB Parallel NB Serial DB Serial PB Parallel NB Serial DB Thrghpt (MBps) 40,00 34,74 45,26 47,37 39,26 49,10 clk 2,5 75 13,49 16 2,76 60 2,11 75 15,18 16 2,55 60
Syndrome computation can also be performed sequentially as the symbols are transmitted. The result of performance measurement is shown in Table VI.
TABLE VI THROUGHPUT SYMBOL-SERIAL SYNDROME CALCULATOR (MBPS) Structure Parallel Serial Tool Xilinx Altium Xilinx Altium PB 36,94 41,23 40,00 47,37 NB 34,74 39,26 26,35 27,10 DB 26,62 32,89 45,26 49,10
Altium
B. Key Equation Solver Key Equation Solver (KES) is the most complex module in RS decoding process. KES is implemented according Berlekamp-Massey algorithm in [34]. The experiment on the six variants of operator within KES circuits results in performance figure shown in Table VII.
TABLE VII KES THROUGHPUT IN MBPS Structure Parallel Serial Tool Xilinx Altium Xilinx Altium PB 147,75 164.90 150,00 177,60 NB 138,95 160,00 98,80 101,57 DB 106,49 131,58 169,71 184,20
To combine the four modules into one integrated system, cycle rate must conform to the slowest component. Hence, the best configurations from each representation basis are shown in Table XIII. It is shown that both synthesis tools produce the best RS Decoder which has serial operation and GF element representation in dual basis. In fact, Table II shows that the best multiplier is in parallel PB.
TABLE X THE BEST KEY EQUATION SOLVER Tool Xilinx Architecture Serial PB Parallel NB Serial DB Serial PB Parallel NB Serial DB Thrghpt (MBps) 150,00 138,95 169,49 177,60 157,00 184,20 clk 2,5 20 13,49 4 2,76 16 2,11 20 11,94 4 2,55 16
C. Error Locator & Magnitude Polynomial Error locator and magnitude are calculated by the circuit described in [34]. Throughput is calculated with max error of 4
Altium
TABLE XI THE BEST ERROR LOCATOR & MAGNITUDE POLYNOMIAL Tool Xilinx Architecture Serial PB Parallel NB Serial DB Serial PB Parallel NB Serial DB Thrghpt (MBps) 160 148,20 181,02 189,5 174 196,50 clk 2,5 5 13,49 1 2,76 4 2,11 5 11,94 1 2,55 4
Altium
TABLE XII THE BEST ERROR CORRECTOR Tool Xilinx Basis PB NB DB PB NB DB Thrghpt (MBps) 2586 2586 2586 2778 2778 2778 TABLE XIII OPTIMAL DECODER Tool Xilinx Basis PB NB DB PB NB DB /clk 2,5 13,49 2,76 2,11 11,94 2,55 #cycle 75+20+5+1 16+4+1+1 60+16+4+1 75+20+5+1 16+4+1+1 60+16+4+1 (ns) 252,5 296,87 223,72 213,2 262,68 206,15 Thrghpt (MBps) 29,70 25,26 33,52 35,18 28,55 36,38 clk 2,9 1 2,9 1 2,9 1 2,7 1 2,7 1 2,7 1
Altium
It is examined that optimization was not applied to parallel multiplications although they also have constant operands. It is due to the fact that the tools optimize constant bit only, whereas parallel multiplier signal is implemented as an m-bit bus. DB multiplier is superior over the PB and NB variants in term of fully serial delivery of the product. With constant values of P(x) and A(x), DB serial multiplier delivers the product bit by bit as the operand bi enters from MSB to LSB. Therefore, variant 6 saves 1 cycle compared to variant 2 and 4 that deliver the product after the whole cycle for m-bit completes. For that reason, variant 6 based systems can proceed addition serially as the product is delivered from index m-1 to 0. With constant P(x) = x4 + x + 1, optimization steps applied to serial DB multiplication is shown by Fig.5. In general this finding supports several statements in the literature that multiplication with constant operands can be more efficient [10]. It was examined as well that optimization does not apply to serial NB multiplier. In serial NB multiplication, the values of both operands change dynamically during the lifecycle of the system due to the rotation of internal registers. IX. CONCLUSION This paper reports that an optimal performance of RS Codec does not always require the best variants or the most efficient GF operators. Combining all operators, each of which is the most efficient variant, is not a simple mechanistic conversion process. Obtaining synergic efficiency at system level requires careful considerations on several factors such as: the operator composition and distribution, interaction between them and types of internal process within the system. In addition, when implemented in FPGA, efficiency improvement is also contributed by the synthesis tool that optimizes serial operators whose operands are constant. This explains why the experimental results show the superiority of serial based system over parallel ones.
Altium
VIII. OPTIMIZATION BY THE SYNTHESIS TOOL Experiment result shows that employing the best operators does not automatically produce the most efficient applications. This phenomenon is shown consistently by Xilinx and Altium synthesis tools. Simple and common logic saying that parallel process is faster than serial does not hold. This is caused by the fact that the synthesis tool does some improvement or limited optimization to the VHDL structural design. Multiplication is obviously the dominant process in RS Codec. Examination on the results shows that multiplication with fixed bit operands is optimized by removing unnecessary components in the system. For specific configuration of RS Codes, P(x) and g(x) are fixed during the system lifecycle. g(x) is one of the operand performing as a(x) in Fig.1. Fixed values of pi and gi cause the content of block Ei that was originally shown as in Fig.4a can be simplified becoming the circuits shown in Fig.4b-c-d-e. All combination of ai and pi results in internal block Ei without any AND gate. By omitting AND gates, Xilinx saves significantly the delay so that the minimum period is 2,5 ns. Hence, processing 4 bit requires 10 ns, which is smaller than parallel multiplication delay 12,69 ns.
[12] [13] [14] [15] [16] [17] [18] Fig. 5. DB Serial Multiplier a) original b) optimized with constant P(x). [19] [20] [21] [22] [23] [24] [25]
It is interesting to examine further the consistency of this optimization in other GF based applications, such as cryptography and other configurations of RS Codec. Explorative experiments are required for real symbol width, m = 8 bit such as the one used by NASA RS(255,223) 8-bit [35] and Rijndael AES with cipher-key 8-bit [36]. Hypothetical prediction suggests that higher performance ratio would be obtained by serial variants over the parallel ones. It is because of additional combinational path in parallel operators that results in bigger delays. Meanwhile, serial operators require only several additional cycles with constant minimum periods. ACKNOWLEDGMENT This material is based upon work supported by European Union Project VN/Asia-Link/001 (79754). The author would like to thank Prof. R.G. Spallek and Mr. Jrg Schneider for providing tools and facilities in the Professur fr VLSIEntwurfssysteme, Diagnostik und Architektur, Technische Universitt Dresden, Germany. REFERENCES
[1] [2] [3] [4] vanTilborg, H. (1988) An Introduction to Cryptology, Kluwer Academic Publishers. Schneier, B. (1993) Applied Chryptography, Wiley & Sons. B.Wicker, S. (1995) Error Control Systems for Digital Communication and Storage, Prentice Hall, Upper Saddle River, New Jersey 07458. Sweeney, P. (2002) Error Control Coding from Theory to Practice, John Wiley & Sons, Inc., England.
[26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36]
Lin, S. and Costello, D. J. (1983) Error Control Coding, Prentice Hall, Englewood Cliffs, New Jersey. Wang, C., Truong, T., Shao, H., Deutsch, L., Omura, J., and Reed, I. August 1985 IEEE Transactions on Computers 34(8), 709717. Hasan, M., Wang, M., and Bhargava, V. August 1992 IEEE Transactions on Computers 41(8), 972980. Zhou, T., Wu, X., Bai, G., and Chen, H. 4 July 2002 Electronics Letters 38(14), 706707. Bartee, T. and Schneider, D. March 1963 Information and Computers 6, 79 98. Berlekamp, E. November 1982 IEEE Transactions on Information Theory 28, 869 874. Massey, J. and Omura, J. Computational method and apparatus for finite field arithmetic Technical report US Patent No. 4,587,627 to OMNET Association Sunnyvale CA, Washington, D.C. (1986) Patent and Trademark Office. Mastrovito, E. D. VLSI designs for computations over finite fields GF(2m) Masters thesis Linkping Studies in Science and Technology Linkping, Sweden December 1988 Thesis No: 159. Mastrovito, E. VLSI Architectures for Computations in Galois Fields PhD thesis Department of Electrical Engineering, Linkping University, Sweden (1991). Gollman, D. Algorithmenentwurf in der kryptographie Habil. University of Karlsruhe (1990) Preprint. Wu, H., Hasan, M. A., F.Blake, I., and Gao, S. 23 August 2001. Sunar, B., Savas, E., and Koc, C . K. November 2003 IEEE Transactions on Computers 52(11), 13911398. Afanasyev, V. (1990) In Proceedings of II International Workshop on Algebraic and Combinatorial Coding Theory Leningrad, USSA: pp. 68. Hasan, M., Wang, M., and Bhargava, V. October 1993 IEEE Transactions on Computers 42(10), 1278 1280. Lee, C.-H. and Lim, J.-I. A new aspect of dual basis for efficient field arithmetic Technical report Samsung Advaced Institute of Technology (1998). Fenn, T. S., Benaissa, M., and Taylor, D. March 1996 IEEE Transactions on Computers 45(3), 319 327. Paar, C. Efficient VLSI Architectures for Bit-Parallel Computation in Galois Fields PhD thesis Institute for Experimental Mathematics, University of Essen Essen, Germany June 1994. Itoh, T. and Tsujii, S. (1988) Information and Computing 78, 171177. Paar, C. and Rosner, M. April 1997 IEEE Symposium on FPGA-Based Custom Computing Machines (FCCM) pp. 219225. Mursanto, P. 29-30 January 2007 In National Conference on Computer Science & Information Technology Depok, Fasilkom UI. Rudra, A., Dubey, P. K., Jutla, C. S., Kumar, V., Rao, J., and Rohatgi, P. (2001) volume 2162, of Proceedings of the Third International Workshop on Cryptographic Hardware and Embedded Systems London, UK: Springer-Verlag. pp. 171184. Jutla, C., Kumar, V., and Rudra, A. On the circuit complexity of isomorphic Galois Field transformations Technical Report RC22652 (W0211-243) IBM Research Division November 2002. Choi, Y., Chang, K.-Y., Hong, D., and Cho, H. July 8 2004 Electronics Letters 40(Issue 14), 852 853. Huang, C.-T. and Wu, C.-W. September 2000 IEEE Transactions on Circuit and Systems 48(9), 909918. Saleh, K. Hardware architectures for Reed-Solomon decoders Technical report Amirkabir University of Technology, Iran January 2003. Shayan, Y., Ngoc, T., and Bhargava, V. November 1990 IEEE Transc. on Vehicular Technology 39(4), 400409. ASICSws Reed Solomon encoder IP core, http://www.asics.ws/doc/rse brief.pdf Nov 2006. Xilinx, I. Reed-Solomon solutions with Spartan-ii FPGAs Technical report Xilinx, Inc. Feb. 2000 White Paper. Reed Solomon coding Technical report VOCAL Technologies Ltd. 90A John Muir Drive, Buffalo - New York 14228 (2006). Sarwate, D. and Shanbhag, N. October 2001 IEEE Transactions on VLSI Systems 9(5), 641655. Sklar, B. (2003) Digital Communications Fundamentals and Applications, Prentice Hall, Inc., New Jersey 2nd ed. edition. Daemen, J. and Rijmen, V. March 2001 Dr.Dobbs Journal 26, 137139.